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ABSTRACT 

This  paper  implements  and  analyzes  a  dual-primal  iterative  substructuring  method,  that 
is  parallel  and  scalable,  for  the  solution  of  a  three-dimensional  finite  element  dynamic  analysis 
of  helicopter  rotor  blades.  The  finite  element  analysis  is  developed  using  isoparametric  hexa- 
hedral  brick  elements.  Particular  emphasis  is  placed  on  the  formulation  of  the  inertial  terms 
that  are  unique  to  rotary  wing  structures.  The  scalability  of  the  solution  procedure  is  stud¬ 
ied  using  two  prototype  problems  —  one  for  steady  hover  (symmetric)  and  one  for  transient 
forward  flight  (non-symmetric)  both  carried  out  on  up  to  48  processors.  In  both  hover  and 
forward  flight,  a  linear  speed-up  is  observed  with  number  of  processors,  up  to  the  point  of 
substructure  optimality.  Substructure  optimality  and  the  range  of  linear  speed-up  are  shown 
to  depend  both  on  the  problem  size  as  well  as  a  corner  based  global  coarse  problem  selection. 

An  increase  in  problem  size  extends  the  linear  speed-up  range  up  to  the  new  substructure 
optimality.  A  superior  coarse  problem  selection  extends  the  optimality  to  a  higher  number  of 
processors.  The  method  also  scales  with  respect  to  problem  size.  The  key  conclusion  is  that 
a  three-dimensional  finite  element  analysis  of  a  rotor  can  be  carried  out  in  a  fully  parallel 
and  scalable  manner.  The  careful  selection  of  substructure  corner  nodes,  that  are  used  to 
construct  the  global  coarse  problem,  is  the  key  to  extending  linear  speed-up  to  as  high  a 
processor  number  as  possible,  thus  minimizing  the  solution  time  for  a  given  problem  size. 

< Tij  =  second  Piola-Kirchhoff  stresses 

6  =  instantaneous  control  angle 

u),u>  =  skew  symmetric  angular  velocity  and  accelera¬ 

tions 

£ij  =  linear  strains 

r  =  displacement  of  a  point  relative  to  inertial 

frame 

xBI ,  xBI  =  displacement  and  velocity  of  frame  B  relative 
to  frame  / 

77,  =  finite  element  curvilinear,  natural,  coordinate 

axes 

a  =  subscript  denoting  finite  element  nodal  quan¬ 
tities 

Br/e/c  =  subdomain  Boolean  restrictions 
BI / B  =  quantity  in  B  relative  to  /  measured  in  B 
C  =  Cauchy-Green  deformation  tensor 

Presented  at  the  American  Helicopter  Society  Annual  Forum 
65,  Grapevine,  Texas,  May  27-29,  2009  CBI  =  orientation  of  frame  I  relative  to  B 


Nomenclature 

A  =  denotes  incremental  quantities 

S  =  denotes  variational  quantities 

etj  =  Green-Lagrange  strains 

if  =  subdomain  interface  fluxes 

k  =  condition  number 

Kij  =  non-linear  strains 

A  =  interface  dual  variables 

Cl  =  rotor  rotational  speed 

ip  =  blade  azimuth  angle 

ij)BI  =  rotation  of  frame  B  relative  to  I 
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D,D'  =  linear  constitutive  relations 

fs  =  subdomain  force  vector 

/£  =  coarse  problem  force  vector 

Ha  =  finite  element  shape  function  for  node  a 

I,B  =  inertial  and  non-inertial  body  fixed  frames 

I3  =  unit  matrix  3x3 

J  =  Jacobian  of  transformation  from  physical  to 
finite  element  natural  coordinate  axes 

Ks  =  subdomain  stiffness  matrices 

KqC  =  coarse  problem  stiffness  matrix 

L  =  composite  ply  angle  transformation 

M  =  interface  preconditioner 

q  =  finite  element  degrees  of  freedom 

S  =  discretized  substructure  interface 

s  =  superscript  denoting  substructure  quantities 

Tm/c/k/f  =  inertial  mass,  damping,  stiffness,  and  force 
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INTRODUCTION 

One  hundred  years  ago  what  is  now  called  the 
Poincare- Steklov  operator  was  introduced.  This  oper¬ 
ator,  which  governs  the  interface  of  a  problem  generated 
by  decomposing  a  larger  problem  into  many  smaller  sub¬ 
problems,  has  spectral  properties  that  are  superior  to 
the  original  problem.  In  particular,  its  condition  number 
(ratio  of  maximum  to  minimum  eigenvalues,  which  de¬ 
termines  the  number  of  iterations  needed  for  an  iterative 
solution)  grows  at  a  rate  that  is  an  order  lower  than  that 
of  the  original  problem.  Every  modern  method  of  itera¬ 
tive  substructuring  is  based  on  one  or  more  variational 
forms  of  this  operator. 

The  word  substructuring  and  the  method  was  intro¬ 
duced  fifty  years  ago  by  Przemieniecki  [1,  2].  Together 
with  Denke  [3],  Argyris  and  Kelsey  [4],  and  Turner  et 
al.  [5],  they  laid  the  foundations  of  displacement  and  force 
finite  element  analyses  of  partitioned  structures.  These 
partitioned  methods  were  the  only  avenues  to  obtain  re¬ 
sults  for  practical  structures  for  which  the  original  prob¬ 
lem  far  exceeded  the  capacity  of  computers  at  the  time. 
The  method  of  substructures  has  remained  the  fastest 
(time),  most  efficient  (memory),  most  reliable  (accurate), 


and  most  flexible  (heterogeneous  physics  and  properties) 
method  of  partitioned  analysis  of  large  scale  structures. 
The  modern  methods  of  primal  and  dual  iterative  sub¬ 
structuring  have  their  origin  and  genesis  in  these  original 
contributions  -  established  long  before  the  advent  of  par¬ 
allel  computers. 

The  advent  of  parallel  computers  opened  opportu¬ 
nity  for  solving  each  partitioned  substructure  using  a 
separate  processor.  A  straight  forward  implementation, 
however,  leads  to  a  dead  end.  First,  the  high  condition 
number  and  lack  of  sparsity  of  the  interface  equation  by 
itself  poses  a  significant  challenge  for  convergence.  Sec¬ 
ond,  the  condition  number  grows  rapidly  both  with  prob¬ 
lem  size  and  with  number  of  partitions,  producing  a  dra¬ 
matic  increase  in  the  required  number  of  iterations.  The 
recognition,  that  a  finite  element  representation  of  the 
substructure  interface  is  precisely  a  discrete  equivalent  of 
the  original  Poincare-Steklov  operator,  allows  the  math¬ 
ematical  theories  of  domain  decomposition  to  be  brought 
to  bear  directly  towards  the  resolution  of  this  key  prob¬ 
lem.  Today,  methods  are  available  that  precondition  the 
interface  and  solve  it  iteratively,  in  a  parallel  and  scal¬ 
able  manner,  requiring  only  local  substructure  calcula¬ 
tions.  These  methods  are  called  iterative  substructuring 
methods.  Their  objective  is  to  provide  optimal  numerical 
scalability,  i.e.  to  ensure  that  the  condition  number  of  the 
pre-conditioned  interface  does  not  grow  with  the  number 
of  substructures  and  grows  at  most  poly-logarithmically 
with  mesh  refinement  within  each  substructure. 

The  mathematical  theory  of  domain  decomposition 
provides  scalable  algorithms  for  two  broad  classes  of  par¬ 
titioning:  overlapping  and  non-overlapping  [6].  Over¬ 
lapping  partitioning  leads  to  additive  Schwarz  methods, 
which  are  variants  of  block  Jacobi  preconditioners.  These 
are  widely  used  in  fluid  mechanics,  but  are  of  little  im¬ 
portance  in  structural  mechanics  -  due  to  the  very  high 
condition  numbers  (108  —  1010)  and  high  bandwidth  of 
practical  structures.  Structural  mechanics  prefer  non¬ 
overlapping  partitioning.  They  lead  to  iterative  substruc¬ 
turing  methods,  a  name  borrowed  from,  and  indicative  of 
the  deep  connections  to  the  long  and  successful  tradition 
of  substructuring.  Henceforth,  the  words  ‘subdomain’ 
and  ‘substructure’  are  used  synonymously. 

The  growth  of  the  mathematical  theory  of  itera¬ 
tive  substructuring  can  be  traced  to  the  seminal  work 
of  Agoshkov  et  al.  [7,  8]  and  Bramble  et  al.  [9]  in  the 
mid-1980s.  The  former  provided  a  detailed  analysis  of 
the  Poincare-Steklov  operator.  The  latter  provided  one 
of  the  earliest  scalable  interface  preconditioners  for  prob¬ 
lems  governed  by  2nd  order  elliptic  partial  differential 
equations  (henceforth,  mentioned  as  2nd  order  problems) 
with  homogeneous  coefficients.  Subsequent  algorithmic 
developments  in  this  area  have  continued  through  the 
1990s  and  2000s  (the  interested  reader  is  referred  to 
monographs  by  Quarteroni  and  Valli  [10]  and  Toselli 
and  Widluncl  [11])  culminating  in  the  increasing  appli¬ 
cation  of  these  methods  for  High  Performance  Comput¬ 
ing  (HPC)  based  large  scale  problems  of  computational 
mechanics  (see  special  eds.  [12,  13]).  Today,  Neumann- 


2 


Neumann  type  primal  methods  known  as  Balancing  Do¬ 
main  Decomposition  with  Constraints  (BDDC)  (see  [14] 
and  references  therein)  and  the  Dirichlet-Dirichlet  type 
dual  methods  known  as  Finite  Element  Tearing  and  In¬ 
terconnecting  (FETI)  methods  (see  [15]  and  references 
therein)  provide  scalable  preconditioners  that  are  opti¬ 
mal  up  to  4th  order  elliptic  problems  that  include  highly 
discontinuous  and  heterogeneous  subdomains. 

Both  the  Neumann-Neumann  and  FETI  methods 
act  on  the  discrete  equivalents  of  the  Poincare-Steklov 
interface  operator.  The  former  acts  on  its  primal  form. 
The  latter  acts  on  its  dual  form.  A  primal  form  consists 
of  variables  that  are  a  direct  subset  of  the  original  un¬ 
knowns,  e.g.,  the  interface  displacements.  A  dual  form 
consists  of  a  different  set  of  variables  that  are  not  a  sub¬ 
set  of  the  original  unknowns,  but  whose  equality  must 
still  be  guaranteed  across  the  interface,  e.g.,  the  reac¬ 
tion  forces.  Both  methods  are  based  on  simultaneous 
Dirichlet  and  Neumann  solves  within  each  substructure 
one  for  preconditioning  and  one  for  residual  calcula¬ 
tion.  Note  that  the  Neumann  solves  are  non-invertible 
for  floating  substructures  that  arise  naturally  from  mul¬ 
tiple  partitioning  of  a  structure.  In  Neumann-Neumann, 
this  singularity  occurs  in  the  preconditioner,  in  FETI  this 
singularity  occurs  in  the  residual  calculation. 

The  first  objective  of  this  paper  is  to  apply  an  ad¬ 
vanced  multi-level  FETI  method,  the  FETI-Dual  Primal 
(DP)  method,  pioneered  by  Farhat  et  al.  [15,  16],  for  the 
parallel  solution  of  a  large  scale  rotary  wing  structural 
dynamics  problem.  The  FETI-DP  method  acts  both  on 
the  primal  and  dual  form  of  the  interface  -  each  form  for  a 
separate  subset  of  the  interface.  We  apply  this  method  in 
the  present  work  because  there  is  a  substantial  volume  of 
published  literature  demonstrating  its  high  level  of  per¬ 
formance  for  large  scale  engineering  problems  (see  refer¬ 
ences  above).  Note,  however,  that  both  the  FETI-DP 
and  the  BDDC  methods  are  closely  connected,  and  we 
refer  the  interested  reader  to  the  recent  work  of  Mandel 
et  al.  [17]  for  an  excellent  exposition  of  this  connection. 

The  state-of-the-art  in  rotary  wing  structural  mod¬ 
eling  involves  a  variational-asymptotic  reduction  of  the 
3-dimensional  (3-D)  nonlinear  elasticity  problem  into  a 
2-D  linear  cross-section  analysis  and  a  1-D  geometrically 
exact  beam  analysis  -  based  on  Berdichevsky  [18]  and  pi¬ 
oneered  by  Hodges  and  his  co-workers  [19].  Aeroelastic 
computations  are  performed  on  the  beam,  followed  by  a 
recovery  of  the  3-D  stress  field.  The  method  is  efficient 
and  accurate  -  except  near  end-edges  and  discontinu¬ 
ities  for  which  a  3-D  analysis  is  still  needed  -  as  long 
as  the  cross-sections  are  small  compared  to  the  wave¬ 
length  of  deformations  along  the  beam.  Modern  hin¬ 
geless  and  bearingless  rotors  contain  3-D  flexible  load 
bearing  components  near  the  hub  that  have  short  as¬ 
pect  ratios  and  cannot  be  treated  as  beams.  Moreover, 
treatment  of  blades,  depending  on  their  advanced  geom¬ 
etry  and  material  anisotropy,  also  requires  continuous 
improvements  based  on  refinements  to  the  asymptotic 
cross-section  analysis  [20,  21]. 

A  second  objective  of  this  paper,  therefore,  is  to 


develop  a  3-D  Finite  Element  Model  (FEM)  for  rotary 
wing  structures  that  can  be  used  to  analyze  generic  3-D 
non-beam  like  hubs  as  well  as  advanced  geometry  blade 
shapes.  With  the  emergence  of  rotorcraft  Computational 
Fluid  Dynamics  (CFD),  physics-based  models  contain¬ 
ing  millions  of  grid  points  can  carry  out  Reynolds  Av¬ 
eraged  Navier-Stokes  (RANS)  computations  on  100s  of 
cores,  routinely,  in  a  research  environment  for  the  ro¬ 
tor,  and  even  for  the  entire  helicopter.  Applications  are 
focused  today  on  coupling  CFD  with  relatively  simple 
engineering-level  structural  models.  The  structural  ana- 
lyis  is  carried  out  on  a  single  processor  while  the  remain¬ 
ing  processors  lie  idle.  Therefore,  the  purpose  of  the  sec¬ 
ond  objective  is  to  explore  the  possibility  of  integrating 
3-D  FEM  as  the  physics-based  Computational  Structural 
Dynamics  (CSD)  model  in  the  structures  domain. 

There  is  no  question  that  such  a  capability  will  be 
powerful.  First,  it  will  provide  enabling  technology  for 
modeling  hingeless  and  bearingless  rotors  with  advanced 
geometries.  Second,  it  will  enable  the  direct  calculation 
of  stresses  on  critical  load  bearing  components  near  the 
hub,  a  capability  that  is  not  available  today.  Third,  it  will 
provide  a  true  representation  of  the  3-D  structure,  con¬ 
sistent  with  the  high  level  of  fidelity  sought  in  large  scale 
CFD.  And  finally,  even  though  this  research  is  targeted 
towards  large  scale  HPC  based  analysis,  as  a  by-product, 
it  will  always  provide  a  means  for  extracting  sectional 
properties  with  which  efficient  lower  order  design  analy¬ 
ses  can  still  be  carried  out.  The  key  question  for  such  a 
capability  is  that  of  an  efficient  solution  procedure. 

The  primary  objective  of  this  paper  is  to  answer  this 
key  question  directly.  As  in  CFD,  the  tremendous  ca¬ 
pabilities  of  HPC  is  also  envisioned  here  to  be  the  key 
technology  driver  and  enabler. 

Scope  of  Present  Work 

A  parallel  Newton-Krylov  solver  is  developed  in  this 
study  for  the  solution  of  3-D  FEM  analysis  of  rotors  in 
hover  and  forward  flight.  The  solver  is  based  on  the 
FETI-DP  method  of  iterative  substructuring.  The  pri¬ 
mary  emphasis  in  this  paper  is  on  the  scalability  of  this 
solver. 

The  formulation  of  the  3-D  FEM  analysis  pays  par¬ 
ticular  attention  on  the  structural  non-linearities  and  in¬ 
ertial  terms  that  are  unique  to  rotary  wings.  The  Krylov 
solver  is  equipped  with  a  Generalized  Minimum  Residual 
(GMRES)  update,  in  addition  to  its  traditional  Conju¬ 
gate  Gradient  (CG)  update,  to  accommodate  the  non- 
symmetric  nature  of  the  rotary  wing  inertial  terms. 

Advanced  finite  element  capabilities  like  locking-free 
elements,  hierarchical  elements,  nonlinear  constitutive 
models,  composite  ply  modeling  are  beyond  the  scope 
of  this  initial  work.  Grid  generation  is  not  part  of  this 
endeavor.  Simple  grids  are  constructed  that  are  adequate 
for  research  purposes.  Domain  partitioning,  on  the  other 
hand,  is  a  part  of  this  work.  Standard  graph  partition- 
ers,  which  are  readily  available,  will  not  suffice  for  reasons 
described  herein.  Most  key  elements  of  a  comprehensive 
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rotorcraft  analysis  are  not  considered  at  present:  airloads 
model,  trim  model,  extraction  of  periodic  dynamics,  and 
multibody  dynamics,  are  all  part  of  future  work. 

The  paper  is  organized  as  follows.  The  next,  i.e., 
the  second  section,  describes  the  formulation  of  the  3-D 
FEM  analysis,  followed  by  preliminary  verification  using 
thin  plate  and  rotating  beam  results.  The  third  section 
presents  a  brief  description  of  the  iterative  substructuring 
algorithm  and  its  parallel  implementation.  The  numeri¬ 
cal  scalability  of  the  algorithm  is  established  in  this  sec¬ 
tion.  The  fourth  section  introduces  the  key  components 
of  the  3-D  rotor  analysis:  geometry  and  grids,  partition¬ 
ing  and  corner  selection,  the  hover  prototype,  and  the 
transient  forward  flight  prototype.  The  fifth  section  is 
focused  on  scalability.  Calculations  performed  on  up  to 
48  processors  -  the  maximum  available  to  the  authors 
at  present  -  are  reported  here.  The  paper  ends  with  the 
key  conclusions  of  this  work  and  a  summary  of  the  future 
research  directions  that  are  critical  to  the  success  of  this 
endeavor. 

3-D  FINITE  ELEMENTS  FOR  ROTORS 

The  Finite  Element  formulation  is  based  on  well  es¬ 
tablished,  standard  procedures  [22,  23].  The  non-linear, 
geometrically  exact  implementation,  follows  an  incre¬ 
mental  approach,  using  Green-Lagrange  strain  and  sec¬ 
ond  Piola-Kirchhoff  stress  measures,  within  a  total  La- 
grangian  formulation.  The  main  contribution  here  is  the 
formulation  of  the  inertial  terms  that  are  unique  to  ro¬ 
tary  wings. 

Strain  Energy 

Let  the  deformed  coordinates  of  a  material  point  in 
the  blade  at  any  instant  be  given  by 

Xi  =  X°  +  Ui(Xi,  x°,  £3) 

*2  =  *2  +  u2(*i,*2,*3)  (1) 

*3  =  *3  +  u3(x\,x%,x%) 

where  x°  and  iq,  i  =  1,2,3  denote  the  undeformed 
coordinates  and  the  3-D  deformation  field  respectively. 
The  deformation  gradient  of  the  point  with  respect  to  its 
undeformed  configuration  is  then  V 


Xl,l 

*1,2 

*1,3 

dXi 

(2) 

x2,l 

*2,2 

*2,3 

where  xi:j  =  — „ 

X3,l 

*3,2 

*3,3  . 

j 

The  Green-Lagrange  strain  relates  the  deformed  length 
of  a  line  element,  dl ,  to  its  original  length  on  the  unde¬ 
formed  blade,  dl° ,  in  the  following  manner 

eijdxidxj  =  (1/2)  [(<iZ)2  —  (dl0)2]  (3) 

where  (dl)2  =  dxidxj  and  (dl0)2  =  dx°dx°.  The 
Green-Lagrange  strain  tensor  follows 

e=(l/2)(C-I3)  (4) 


Here  I3  is  a  3  x  3  unit  matrix  and  C  is  the  left  Cauchy- 
Green  deformation  tensor  given  by 

C  =  VT  V  (5) 


The  elements  of  the  Green-Lagrange  strain  tensor  have 
the  well-known  form 


1  /  dui  duj  duk  duk 

2  l  dx°  +  dx°  +  dx°  dx° 


k  =  1,2,3  (6) 


The  total  Lagrangian  formulation  is  based  on  virtual 
work  per  unit  original  volume.  The  stress  measure  that 
is  energetically  conjugate  to  Green-Lagrange  strain  is  the 
second  Piola-Kirchhoff  stress  tensor,  a.  That  is,  the 
strain  energy  of  the  deformed  structure  can  now  be  cal¬ 
culated  using  the  above  strain  and  stress  measures  with 
integration  over  original  volume.  The  variation  in  strain 
energy  is  then  simply 


5U  =  J  a  Seij  dV  (7) 

v 

For  non-linear  analysis,  an  incremental  procedure  is  fol¬ 
lowed.  The  strain  at  a  current  state  t  +  At  is  expressed 
in  terms  of  incremental  strains  measured  from  a  previous 
state  t 


£ij(t  +  At)  —  €ij(t)  +  A  €ij  (8) 

where  A e,y  is  the  incremental  strain.  It  must  be  under¬ 
stood  that  the  variation  in  strain  at  the  current  state  is 
simply  the  variation  in  incremental  strains.  That  is, 


6eij(t  +  At)  =  6  A  eij  (9) 

The  incremental  strain  is  related  to  the  incremental  de¬ 
formations  Aui. 

Aui  =  Xi(t  +  At)  -  Xi(t) 

=  Ui(t  +  At)  -  Ui(t) 

Substitution  of  the  strain  expression  Eq.  6  in  Eq.  8  and 
use  of  Ui(t  +  At)  =  Ui(t)  +  Am  gives 


1  /  dAui  dAuj 

2  y  dx°  +  dx° 

1  dA Uk  dAuk 
+  2  dx°  dx° 

1  J 


duk  dA  Uk  dA  Uk  duk  \ 
dx°  dx°  dx°  dx°  J 

A  £ij  +  A  Kij  k  =  1,2,3 

(11) 


where  the  linear  and  non-linear  strains  are  separately 
denoted  as  £ij  and  Kij.  The  variations  follow 

6Aeij  =  SAeij  +  SAKij  (12) 


where 
SAejj  = 

1  f  dSAui  dSAuj  duk  dSAuk 

2  y  dx°  dx°  dx°  dx° 

6  A  Kij  = 

1  / dAuk  dSAuk  dSAuk  dA Uk\ 

2  ~lhd~  +  lyf  J 


dSAuk  diik 
dx°  dx° 


(13) 
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Similarly  decompose  the  stress 

<7ij(t  +  At)  =  <Jij(t)  +  A  <Ti:j  (14) 

Use  Eqs.  12,  13  and  14  in  Eq.  7  to  obtain 


SU  =  J  A Oij  6 A Sij  dV  +  J  SAeij  dV+ 

v  v 


(15) 


The  integration,  as  before,  is  over  the  original  volume. 
This  expression  is  exact  for  large  deformations,  large 
strains,  and  material  non-linearities.  For  linear  elastic 
materials,  the  incremental  stress  is  related  to  the  incre¬ 
mental  linear  strain  by  a  constitutive  relation  of  the  form 


A(Jij  —  D  'tj  jii  n  A  sm  n 

Dijmn  has  the  general  form  LT  D'  L  with  D'  contain¬ 
ing  the  material  constitution  and  L  the  composite  ply 
angle  transformation.  The  first  term  in  Eq.  15  then  be¬ 
comes  the  standard  linear  finite  element  term.  The  sec¬ 
ond  term  is  an  incremental  term  involving  the  existing 
state  of  stress  and  linear  strains.  The  third  term,  under¬ 
lined,  is  the  key  term  for  rotorcraft.  It  is  an  incremental 
term  involving  the  existing  state  of  stress  and  the  non¬ 
linear  strains.  This  term  produces  the  structural  cou¬ 
plings  between  axial  and  transverse  deformations  (torsion 
is  not  a  separate  state  of  motion  in  3-D  but  a  function 
of  transverse  deformations)  in  response  to  rotational  ef¬ 
fects.  It  carries  within  it  the  classical  extension-bending 
and  flap-lag  structural  couplings.  The  fourth  term  is 
dropped  as  part  of  linearization,  with  the  assumption 
Dijmn^mnSnij  «  0.  The  final  expression  then  becomes 


SU  —  I  Dijmn  ^£mn  SA£ij  (LV ~\~ 


8AC„  dV  +  J „v(t)  6A«V  dV 


(16) 


Iterations  are  required,  of  course,  primarily  for  <Jij(t)  but 
also  for  the  linearization.  The  latter  is  usually  insignif¬ 
icant.  For  example,  for  a  static  solution,  given  a  pre¬ 
scribed,  deformation-independent,  non-inertial  external 
forcing,  the  equation  of  motion  takes  the  form 


be  updated  on  both  sides  initially  to  obtain  the  correct 
non-linear  stiffness.  Thereafter,  modified  Newton  itera¬ 
tion  is  enough  for  the  purposes  of  airload  non-linearities. 

Kinetic  Energy 

The  variation  in  kinetic  energy  or  the  virtual  work 
by  inertial  forces  is  given  by 

ST  =  -SWr  =  J  pp-  Sr  dV 
v 

where  r  and  Sr  are  the  acceleration  and  virtual  displace¬ 
ment  of  a  material  point  P  on  the  blade  relative  to  an 
inertial  frame  I.  Let  P  lie  in  a  non-inertial  frame  B. 
The  frames  /  and  B  are  associated  with  corresponding 
coordinate  axes  or  basis.  At  any  instant,  frame  B  has  a 
displacement  xBI  relative  to  the  frame  /  and  an  orienta¬ 
tion  defined  by  a  rotation  matrix  CIB .  CIB  defines  the 
orientation  of  I  relative  to  B ,  i.e.  it  rotates  the  axis  from 
B  to  I.  If  the  components  of  xBI  expressed  in  B  and  I 
basis  are  denoted  by  xBI^B  and  x31!1 ,  then 

x31'1  =  C1BxB1'B 

Recall,  that  the  time  derivative  of  the  rotation  matrix 
is  related  to  the  skew  symmetric  angular  velocities  by 

CIB  =  CibAbi'b  =  —CoIBlICIB 

where  £jbi/b  is  the  angular  velocity  of  B  relative  to  I 
and  measured  in  B ,  and  u >IB/J  is  the  angular  velocity 
of  I  relative  to  B  and  measured  in  I.  The  components 
of  the  motion  of  the  point  P  relative  to  I  and  B  and 
expressed  in  I  and  B  frames  satisfy 

rp//J=  CIB[xBI'B  +  rPB'B) 

■PI /I  =  cIB^BI/B  +  +PB/B  +  ~BIIBrPBIB ) 

-P///  =  CIB^BI/B  +  -BIIBVBIIB  +  fPB/B+  (lg) 

2-BI/s-pB/B  +  ~BI/B~BJ/BrpB/B+ 

~jBI  I B  rpB  I  B'j 

where  the  frame  motions  have  been  expressed  in  body- 
fitted  coordinates  as  x31!1  =  v31^1  =  CIBvBI/B  and 
x31/1  =  cIBvBI/B  +  CIBuIB/BvBI/B.  The  components 
of  virtual  displacement  are 


SU  =  SWE 


where  SWe  is  the  external  virtual  work.  The  iterative 
procedure  is  then 


JDijmn  A£mn  SAsij  dv+!ai ^  6AKi> dv 

V  V 

=  SWe  ~  J  SAe^  dV 


(17) 


readily  recognized  as  a  Newton-Raphson  iteration.  If 
a ij(t)  is  updated  only  on  the  right  hand  side,  the  proce¬ 
dure  is  a  modified  Newton  iteration.  For  a  rotor,  it  must 


Sr31'1  =  CIB(SxBI/B  +  S^BI'BrPB'B  +  5rPB'B) 

=  CIB(SxBI/B  -  fPB!B Si\)BI!B  +  SrPB'B) 

=  CIBRTSq 

(19) 

where 

/  I  \  (  8xBI/B  \ 

R  =  I  —rPB!B  ]  Sq  =  I  Sifi31^3  ) 

V  Rl  )  \  Sqb  J 

XBI/B  anci  ipBI/B  are  the  rigid  body  translational  and 
rotational  states  of  frame  B.  The  virtual  displacement 
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SrpB/S 

in  B  frame  is  written  in  terms  of  the  variations 
in  finite  element  degrees  of  freedom  SrPB/B  =  RpSqb- 
The  kinetic  energy  is  then 


ST  =  Jp  (fr'-'/'g  f"/'  dV  m 

V 

In  general,  frame  B  may  contain  a  flexible  component. 
For  a  simple  illustration,  consider  B  to  be  the  unde¬ 
formed  blade  rotating  frame,  containing  the  entire  blade, 
with  origin  at  the  rotor  hub.  It  undergoes  control  mo¬ 
tions  9 , 6 ,  9  with  respect  to  another  rotating  frame  with 
origin  at  the  hub.  This  frame  undergoes  rotational  mo¬ 
tions  fl,fl,fl  with  respect  to  a  non-rotating  inertial  frame 
I  at  the  hub.  B  has  no  rigid  body  states  with  respect  to  I. 
Thus  SxBI!B  =  0,  5ij> BI/B  =  0,  and  vBI/B  =  vBI/B  =  0. 
It  follows 


CIB 


C-p  —SipCg  S^Sg 

S'tp  CijjCO  C-ipS  0 

0  Sg  eg 


(21) 


where  ip  =  fit  is  the  blade  azimuth  angle,  9  is  the 
instantaneous  control  angle,  and  eg  =  cos  9,  sg  =  sin  9 
etc. 

o  \T 

flsg  eB 
flcg  ) 

(22) 


eB  and  e1  are  the  unit  basis  vectors  in  the  B  and  I  frame 
respectively.  The  non-zero  components  of  rPI ^  in  the 
kinetic  energy  expression  Eq.  20  then  become 


y-PB/B 


Ul 

U2 

U3 


(23) 


2ujBI  /  B  vPB/ B  =  2 


0 

flee 
— fise 


—flee  f Ise 
0  -9 

9  0 


Thus,  the  kinetic  energy  Eq.  20  takes  the  following  form 


ST  = 


p  (Tm  u  +  Tcii  +  Tku  +  Tf)dV 


(28) 


With  the  simplest  assumption  of  only  a  collective,  and 
steady  rotation,  we  have  from  Eq.  23-Eq.  26  the  following 
mass,  damping,  stiffness,  and  force  contributions  from 
inertia. 


Tm  —  I3 
Tc  =  -20 

Tk  =  -O2 


0  eg  - Sg 
-eg  0  0 

Sg  0  0 

10  0 
0  c2g  SgCg 

0  - SgCg  Sg 


Tf  =  Tk[x\x%xl}T 


(29) 


Virtual  Work  by  External  Forces 

A  surface  lies,  always,  on  one  or  more  of  the  element 
faces,  defined  by  its  natural  coordinates  £  ±  1,  77  ±  1,  or 
(  ±  1  (see  following  section).  A  differential  change 
in  the  natural  coordinates  (£,  77,  Q  creates  the  following 
changes  in  the  geometric  coordinates  (xi,  X2,  X3). 

N 

dxj  =  Xj^d^  =  Hkt£X*  i  =  1,2,3  (30) 

fc= 1 

Denote  the  vector  [xi^,  X2,f,  £3,5]  as  x Similarly,  dif¬ 
ferential  changes  drj  and  d(  create  the  vectors  Xi^di 7  and 
Xixd(,  i  =  1,2, 3,  in  the  geometric  coordinates.  The  area 
d^dp,  for  example,  then  corresponds  to  the  area  of  a  par¬ 
allelogram  between  the  two  vectors  x^d£,  and  x^dr 7  in 
the  geometric  domain,  defined  by  their  cross  product  or 
cross  product  matrices:  x ^  x  x  rl  =  —xt^xtV  =  xt^xtV 


x ^  x  x)7J  = 


,rj  *^2, 77*^3, £ 

*£3, £*^1,77  *^3,77^1,^ 

3'l,£%2,r)  ^1,77^2,^ 


-B//B  -BI/B  rPB/B  = 


-II2 

9flsg 

9flcg 

(  Xi 

9flsg 

- n2c2g  -  92 

fl2CgSg 

X2 

9flcg 

fl2CgSg 

- n2s2e  -  92  _ 

\  X3 

~BI/BrPB/B  = 


0 

— flcg  -f-  fl9sg 

flsg  T  fl9cg 

(  Xl  \ 

flcg 

-  fl9sg 

0 

9 

x2 

—  flsg 

—  fl9cg 

9 

0 

\  *3  / 

(26) 


The  virtual  displacement  is  simply  the  variation  of 
the  incremental  displacements 

SrPB/B  =  Su ;  5rPI/I  =  CIBSu  (27) 


For  example,  if  normal  pressure  p  on  an  elemental  face 
defined  by  £  =constant  produces  a  virtual  work  p  dA  ■  Su, 
the  components  of  dA  are  simply  x^xtV  dt;  dp.  The  com¬ 
ponents  of  Su  are  the  virtual  deformations  5[u-i  U2  M3]71  = 
SqT Ht  ,  where  H  are  structural  shape  functions  (see  next 
section).  The  virtual  work  expression  is  then 

SWe  =  SqT  j  pHT  x^xtV  d£dp  =  SqT  Q  (31) 

A 

A  general  surface  stress  distribution  <75  is  incorporated 
in  exactly  the  same  manner  using  (as  •  dA)  ■  Su  as  virtual 
work. 

SWe  =  SqT  J  Ht  as  x^xiV  d^dp  =  SqTQ  (32) 

A 
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Because  fluid  stress  are  deformation  dependent,  the  area 
must  be  evaluated  at  the  deformed  configuration.  The 
deformed  configuration  is  not  known  a  priori,  therefore, 
iterations  are  required.  This  is  discussed  further  in  the 
subsection  ‘Steady  Hover  Prototype’  under  ‘3-D  FEM 
Rotor  Analysis’. 


both  interpolated  using  the  same  shape  functions.  Thus 


N  N 


Y' H  x°  ■ 

/  v  11a  ^lai 
d — 2. 

<1 

WI 

II 

3 

<3 

N 

^  ^  Ha  Xi  a; 
a= 1 

N 

U'i  ^  ^  Ha  ILi  ( 

a=  1 

Brick  Finite  Elements 

The  analysis  of  bending  dominated  problems  involv¬ 
ing  thin  structures  using  3-D  elements  suffer  from  severe 
stiffening  known  as  element  locking  as  the  element  thick¬ 
ness  tends  to  zero.  A  simple  but  effective  way  to  pre¬ 
vent  locking  is  to  use  higher-order  elements  -  as  in  this 
study  -  containing  sufficient  number  of  internal  nodes. 
Devising  more  efficient  lower-order  locking-free  brick  el¬ 
ements,  based  on  reduced-integration  or  Enhanced  As¬ 
sumed  Strain  methods  are  beyond  the  scope  of  this  initial 
development.  The  main  concern  at  present  is  accuracy. 


where  a  is  the  elemental  node  point  index.  N  =  27.  The 
shape  functions  are  expressed  in  element  natural  axes  £, 
77,  and  C,  where  —  1  <  £,77,  C  <  1.  We  consider  Lagrange 
polynomials  in  each  direction. 

Ha  =  Ha(Z,r),Q  =  £?(£)  LJ(V)  14(C)  (34) 

with  n  =  m  =  p  =  2;  and  each  of  /,  J,  K  vary  as  1,  2,  3 
respectively.  The  second-order  polynomials  in,  say  77,  are 
77(77  —  l)/2,  1  —  ?72,  and  77(77  +  l)/2.  Based  on  the  local 
node  ordering  shown  in  Fig.  1(b),  we  have  for  example 
the  shape  function  corresponding  to  node  11 


A 

(a)  Element  in  physical 
coordinates  (only  edge  nodes  shown) 
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coordinates  (all  nodes  shown) 

Figure  1:  27-node  isoparametric,  hexahedral 
brick  element  in  curvilinear  natural  coordi¬ 
nates;  4x4x4  Gauss  integration  points. 

Figure  1  shows  an  isoparametric,  hexahedral, 
quadratic  brick  element  developed  in  this  study.  It  con¬ 
sists  of  8  vertex  nodes  and  19  internal  nodes  -  12  edge 
nodes,  6  face  nodes,  and  1  volume  node.  Within  isopara¬ 
metric  elements,  geometry  and  displacement  solution  are 


Hi!  =  m)  L2(,?)  Lf(C)  =  \r, C(1  -  C2)  (V  +  1)  (C  -  1) 

The  strains  require  the  derivatives  of  the  shape  functions 
with  respect  to  geometric  coordinates 

dm  ( dHa  \  dAui  _  ^  ( dHa  \  A 

dx°r h{dx°J Uia;  Uia 

(35) 

These  are  calculated  from  the  derivatives  with  respect  to 
natural  coordinates  as  follows 


/  dHa  \ 

[  a! 

1 

r 

\  ac  ) 

dx J 

dx° 

dx°3 

~w 

dx J 

di 

dx'l 

ac 

a^° 

dr/ 
dx J 

dij 

dxl 

drj 

dx°3 

ac 

ac 

/  (36) 


J 


The  evaluation  of  the  Jacobian,  J,  is  straight  forward 
using  the  derivatives  of  the  shape  functions  with  respect 
to  element  natural  axes  and  the  location  of  the  nodal 
points  (i.e.  the  grid  points),  i.e.,  the  first  of  Eq.  33 


'  HU 

H2ti  . 

■  Hnx 

J  = 

H^r, 

H‘2.r}  ■ 

•  Hjy^ 

H2X  ■ 

■  Hn,  c 

X11 

T° 

X21 

To 

X31 

x12 

T° 

x22 

To 

x32 

IN  dj2N  X3N 

(37) 


The  required  derivatives  are  then 


dHa 

~dxT 

dHa 

M 


(38) 
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'  Hl.l 

0 

0 

H2, 1  . 

0 

0 

Hi, 2 

0 

0 

0 

0 

0 

Hi, 3 

0 

•  Hjv,  3 

ffl,2 

Hip 

0 

H2,2 

0 

0 

Hi, 3 

Hi,  2 

0 

•  Hjv,  2 

tq 

co 

0 

Hi,i 

H2, 3  . 

•  Hat,  1 

^2lHi,i 

^3lHi,i 

^llH2,i 

tq 

CO 

^12Hi,2 

^22Hi,2 

^32Hi,2 

^12  H2,2 

I32  Hjv,  2 

^13Hi,3 

^23Hi,3 

^33Hi,3 

^13H2j3 

133Hn,3 

hiH\t2  +  ?i2Hi,i 

^2lHi,2  +  £22Hi,i 

^3lHi,2  +  I32H11 

^llH2,2  +  !i2H2,i 

■  ■  ■  h\HN,2  +  teH/v,! 

^12Hi,3  +  I13H12 

^22Hi,3  +  I23H1.2 

^32  Hi, 3  +  I33H12 

^12H2,3  +  ^13H2,2 

•  •  ■  ^32Hjv,3  +  l33HN2 

^llHi,3  +  Zi3Hi,i 

^2iHi,3  +  I23H  1,1 

^3iHi,3  +  I33H11 

^iiH2,3  +  Zi3H2,i 

■  ■  ■  ^3lHjv,3  +  /33Hjv,i 

Henceforth  the  above  derivatives  are  denoted  as 
Ha,i ,  Ha,2,  and  Ha  ,3.  In  addition,  the  first  quantity  in 
Eq.  35  is  denoted  by  lij,  i.e. , 

N 

l‘ij  ^  '  B a.j  ^i.  a 
a—  1 

From  the  linear  strain  A £jj  as  defined  in  Eq.  11,  the 
strain-displacement  relation  now  takes  the  following  form 

Ae  =  {Blq  +  Bli)Aq  (39) 

where 

AeT  =  A[en  £22  £33  2ei2  2e23  2ei3] 

A qT  =  [till  U2i  U3i  Ui2  U22  U32  ■  ■  ■  Uin  U2N  t*3Af] 

(40) 

and  the  expressions  for  Bl 0  and  Bl  1  are  given  above. 
The  first  term  in  the  strain  energy  Eq.  16  then  becomes 

SAqT  ^  J  ( Blo  +  BL1)T  D  {Bl 0  +  BL1)  dV^j  A q 

(41) 

The  second  term  is  treated  is  the  same  manner  to  obtain 
5Aqr  (Blo  +  BLl)T  a(t)  dV^j  Aq  (42) 

where  a(t)  =  [<Jn  <r22  033  a  12  023  (Ti3]t.  Consider  the 
non-linear  incremental  strain  Ak,3  as  defined  in  Eq.  11. 
It  is  directly  in  a  quadratic  form,  and  hence  the  third 
term  can  be  re-arranged  as 

6AqT  ^  J  Btnl  a(t)  Bnl  dV^j  Aq  (43) 

with  the  matrices  having  special  forms 


1 

q 

c-+- 

0 

0 

'  0 

0 

0  ' 

cr(t)  = 

0 

0 

;  0  = 

0 

0 

0 

0 

0 

0 

0 

0 

1"  BNl 

0 

0 

'  0  ' 

Bnl  = 

= 

0 

Bnl  0 

;  0  = 

0 

0 

0 

Bnl  j 

0 

0 

0 

H2, 1 

0 

0  ... 

Ha-,i 

Bnl  = 

Hi, 2 

0 

0 

H2, 2 

0 

0  ... 

Hn,  2 

Hi, 3 

0 

0 

H2, 3 

0 

0  ... 

Hat,  3 

The  volume  integrations  are  performed  using  4  Gauss 
points  along  each  natural  coordinate  axes,  a  total  of  64 
integration  points.  Note  that 

dV  =  det( J)  d£  dr]  dC, 

VERIFICATION  OF  3-D  FEM 

A  preliminary  verification  of  the  3-D  FEM  analysis  is 
carried  out  by  reproducing  non-rotating  plate  and  rotat¬ 
ing  beam  frequencies.  The  former  verifies  the  locking-free 
behavior.  The  later  verifies  the  non-linear  implementa¬ 
tion. 

Here,  and  henceforth,  a  3-D  grid  is  denoted  as 
ni  x  n2  x  n3  or  (m,  n2,  n3).  They  denote  the  number  of 
elements  along  span,  chord,  and  thickness,  respectively. 

Thin  Plate  Frequencies 

The  locking-free  behavior  of  the  brick  elements,  in 
shear,  is  verified  by  re-producing  classical  Kirchhoff  thin 
plate  frequencies  for  a  square  cantilevered  plate. 

The  plate  is  modeled  using  a  4  x  4  x  4  brick  grid 
(Fig.  2),  i.e.,  there  are  4  layers  of  bricks  across  thickness. 
Starting  from  a  solid  cube,  the  variation  in  natural  fre¬ 
quencies  with  a  gradual  reduction  in  thickness  is  shown 
in  Fig.  3.  It  is  clear  that  the  frequencies  approach  those 
of  a  thin  plate.  The  plate  frequencies  are  obtained  from 
converged  rectangular  plate  finite  elements  and  are  vali¬ 
dated  easily  with  classical  solutions  [24] .  The  discrepancy 
at  the  higher  modes  are  mostly  resolved  with  a  finer  mesh 
converged  solution  (Table  1).  The  remaining  differences 
are  due  to  shear,  not  present  in  the  Kirchhoff  solution, 
but  present  in  the  brick  solution. 


Figure  2:  A  cantilevered  square  plate 
with  4x4x4  brick  element  grid;  thick¬ 
ness  t  =  1  in;  dimensions  a  =  39.4  in 


Mode 

number 

4x4x4 

bricks 

8x8x4 

bricks 

Kirchhoff 

plate 

1 

3.55 

3.50 

3.47 

2 

8.68 

8.53 

8.51 

3 

22.93 

21.58 

21.29 

4 

28.16 

27.29 

27.19 

5 

32.84 

31.19 

30.96 

6 

56.78 

54.31 

54.13 

7 

71.19 

63.09 

61.29 

8 

74.75 

64.96 

64.16 

9 

83.68 

72.50 

70.98 

Table  1:  Plate  frequencies  using  3-D  FEM:  nondi- 
mensionalized  w.r.t.  \J D/ pta4,  a:  square  plate  di¬ 
mension,  t :  thickness,  p :  density,  D  =  Et3  / 12(1  — 
v2),  E:  Young’s  Modulus  and  v.  Poisson’s  ratio 

Slender  Beam  Frequencies 

Next,  the  brick  element  is  verified  using  a  slender 
geometry  that  behaves  as  a  beam  over  a  large  variation 
of  thickness  and  aspect  ratios.  The  bending  frequencies 
are  easy  to  re-produce,  we  focus  on  torsion.  Consider 
a  uniform  beam  of  aspect  ratio  100  and  a  square  cross- 
section,  i.e.,  dimensions  of  100c  x  c  x  c,  in  length,  width, 
and  thickness.  The  torsion  frequencies  converge  towards 
Euler-Bernoulli  (EB)  values  with  4  to  9  cross-section  ele¬ 
ments  (Table  2)  for  this  simple  geometry.  The  remaining 
difference  stems  from  span- wise  resolution. 

The  effect  of  span-wise  resolution  is  shown  in  the 
first  row  of  Table  3,  starting  from  8x3x3  grid  as 
the  baseline.  The  higher  (second)  torsion  frequency  is 
shown.  Span-wise  resolution  becomes  more  important  as 
the  beam  thickness  is  reduced.  This  is  shown  in  the  sub¬ 
sequent  rows  and  columns  of  Table  3.  The  rows  show 
the  variation  of  torsion  frequency  with  a  progressively 
thinner  beam.  The  columns  show  the  effect  of  span-wise 


Figure  3:  Natural  frequencies  of  a  solid  cube  ap¬ 
proaching  Kirchhoff  thin  plate  frequencies  (sym¬ 
bols)  with  reduced  thickness 


section  grid 

Torsion  1 

Torsion  2 

lxl 

1.710 

5.132 

2x2 

1.586 

4.758 

3x3 

1.577 

4.733 

4x4 

1.576 

4.728 

5x5 

1.575 

4.726 

Table  2:  Non-rotating  beam  torsion  frequencies 
vs.  cross-section  grid  refinement;  8  span-wise 
elements;  nondimensionalized  w.r.t.  a/GJ/JL2; 
EB  values  are  1.571  and  4.712;  beam  dimension: 

100c  x  c  x  c 


Thickness 

t 

ni  =  8 

7li  =  16 

II 

to 

o 

c 

4.733 

4.726 

4.725 

c/2 

4.757 

4.746 

4.742 

c/4 

4.824 

4.794 

4.784 

c/8 

4.886 

4.839 

4.824 

Table  3:  Non-rotating  second  torsion  frequency  vs. 
span-wise  grid  refinement;  grids  are  m  x  3  x  3; 
nondimensionalized  w.r.t.  \J GJ/IL 2;  EB  value  is 
4.712;  beam  dimension:  100c  x  c  x  t 

refinement  for  each  thickness.  There  is  increased  devia¬ 
tion  from  Euler-Bernoulli  values  as  the  thickness  reduces. 
With  thickness  fixed  at  c/4  and  grid  at  16  x  3  x  3,  the 
aspect  ratio  of  the  beam  is  now  progressively  reduced. 
Table  4  shows  that  from  100  to  20  the  frequencies  re¬ 
main  nominally  constant.  At  aspect  ratio  5  there  is  still 
only  an  error  of  5  —  6%.  This  deviation  stems  clearly 
from  the  short  aspect  ratio  of  the  problem  -  even  though 
its  exact  nature  and  source  is  not  studied  here  in  detail. 

The  configuration  with  aspect  ratio  20,  i.e.  dimen¬ 
sions  20c  x  c  x  c/4,  is  now  used  to  compare  the  rotating 
frequencies.  The  material  modulii  are  E  =  8.2700  x  107 
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(a)  Mode  3:  Flap  2 


(b)  Mode  4:  Lag  1 


(c)  Mode  5:  Flap  3  (d)  Mode  6:  Torsion  1 


(e)  Fan  plot  showing  rotating  frequencies 


Figure  4:  Rotating  frequencies  and  mode  shapes  of  an  uniform  soft  in-plane  hingeless  rotor  of  aspect 
ratio  20,  thickness  25%  chord;  3-D  bricks  vs  1-D  beam;  16  x  3  x  3  grid 


Aspect 

Ratio 

Torsion  1 

Torsion  2 

100 

1.598 

4.794 

40 

1.599 

4.799 

20 

1.603 

4.813 

15 

1.606 

4.826 

10 

1.615 

4.860 

8 

1.622 

4.890 

6 

1.635 

4.919 

5 

1.645 

4.991 

4 

1.662 

5.064 

3 

1.794 

5.478 

Table  4:  Non-rotating  torsion  frequencies  vs.  as¬ 
pect  ratio;  16  x  3  x  3  grid;  nondimensionalized 
w.r.t.  y/GJ/IL 2;  EB  values  remain  1.571  and  4.712 
for  all  aspect  ratios. 

Pa  and  G  =  3.4458  x  10'  Pa  (Poisson’s  ratio  v  =  0.2), 
density  is  p  =  192.2208  kg/m3,  and  the  dimension 
c  =  0.0864  m.  The  rotation  axis  is  at  mid-chord. 

A  few  of  the  key  rotating  modes  are  shown  in 
Figs.  4(a)-  4(d).  The  frequency  plot,  Fig.  4(e),  shows 
the  the  beam  frequencies  are  almost  exactly  reproduced 
by  3-D  FEM  -  and  the  small  grid  size  of  16  x  3  x  3  is 
adequate  for  this  simple  problem.  Note  that  this  serves 
as  a  verification  of  the  non-linear  formulation.  The  tor¬ 
sion  frequency  shows  an  error  of  5%  consistent  with  the 
deviation  in  the  non-rotating  case  at  this  level  of  grid 
refinement.  The  torsion  frequency  is  relatively  high  for 
this  structure,  and  occurs  only  as  the  sixth  mode.  The 
rotating  frequencies  for  SI  =  27  rad/s  are  tabulated  in 
Table  5. 


Mode 

Beam  freqs. 

3-D  freqs. 

Mode 

no. 

(/rev) 

(/rev) 

type 

i 

0.826 

0.824 

Lag  1 

2 

1.059 

1.058 

Flap  1 

3 

2.768 

2.769 

Flap  2 

4 

5.058 

5.006 

Lag  2 

5 

5.211 

5.223 

Flap  3 

6 

6.431 

6.625 

Torsion  1 

7 

8.542 

8.597 

Flap  4 

Table  5:  Rotating  frequencies  for  a  soft  in-plane 
hingeless  rotor;  16  x  3  x  3  grid  in  3-D 

ITERATIVE  SUBSTRUCTURING  USING 
PARALLEL  KRYLOV  SOLVER 

The  parallel  Krylov  solver  developed  in  this  study 
to  provide  an  efficient  and  scalable  3-D  FEM  solution  is 
described  in  this  section. 

The  effectiveness  of  the  method  of  substructures  in 
solving  large  scale  problems  is  mentioned  already  in  the 
introduction.  Each  substructure  (or  subdomain)  can  use 
a  solver  tailored  to  its  local  condition  number.  Most 
often,  for  reasons  cited  in  the  introduction,  the  substruc¬ 
tures  are  required  to  be  solved  using  direct  factorization. 
This  is  the  approach  taken  in  the  present  paper.  Iterative 
substructuring  then  provides  a  preconditioned  iterative 
solver  for  the  substructure  interface  problem. 

The  interface  problem  is  more  amenable  to  an  iter¬ 
ative  solver,  unlike  the  substructures  themselves,  as  its 
condition  number  (ratio  of  maximum  to  minimum  eigen¬ 
values)  grows  with  substructure  mesh  resolution  at  a  rate 
that  is  one  order  lower  compared  to  the  original  problem 
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-  i.e.  by  0(ft,_1)  for  2nd  order  and  by  0(/i^3)  for  4th 
order  PDEs  where  h  is  a  typical  mesh  resolution  within 
each  substructure.  However,  it  also  grows  necessarily  at 
0(H~1)  where  H  is  an  average  subdomain  size.  Indeed, 
for  a  2nd  order,  elliptic,  positive  definite,  and  coercive 
operator,  and  an  uniform  finite  element  meshing,  the  pre¬ 
cise  condition  number  n  of  the  interface  S  is  proven  to 
be  [25] 


k(S)  <  c 


H 

hHl 


where  H  is  the  maximum  and  H„,  is  the  minimum 
subdomain  size,  c  is  a  constant  independant  of  h,  H , 
and  Hm.  The  objective  of  iterative  substructuring  is  to 
provide  parallel  preconditioning  methods  such  that  the 
preconditioned  interface  problem  has  a  condition  number 
independent  of  both  h  and  H.  Such  a  preconditioner  is 
called  an  optimal  preconditioner. 

The  dependence  on  0 (I7“2)  cannot  be  removed  with¬ 
out  a  higher  level  coarse  problem  -  a  mechanism  to  prop¬ 
agate  local  substructure  information  globally.  Communi¬ 
cation  only  between  neighboring  subdomains  will  always 
show  this  dependence.  In  FETI-DP,  the  coarse  problem 
is  constructed  out  of  a  selected  set  of  substructure  corner 
nodes. 

The  building  blocks  of  a  preconditioned  Krylov 
solver  (to  solve  M~xAx  =  M~lb ,  where  M  is  the  pre¬ 
conditioner)  are:  (1)  residual  calculation  r  =  b—  Ax,  (2) 
preconditioning  M~1r  and,  (3)  a  general  matrix-vector 
multiplication  procedure  Av.  An  iterative  substructur¬ 
ing  algorithm  computes  these  building  blocks  in  a  paral¬ 
lel  manner.  Once  the  building  blocks  are  provided,  con¬ 
structing  an  iterative  update  (Krylov  update)  using  these 
blocks  is  straight  forward.  The  blocks  and  the  update 
then  form  the  Krylov  solver.  Unlike  a  simple  Conjugate 
Gradient  (CG)  update,  a  Generalized  Minimum  Residual 
(GMRES)  update,  however,  poses  a  few  parallelization 
issues  of  its  own. 

The  iterative  substructuring  algorithm,  its  numeri¬ 
cal  scalability,  and  the  parallelization  issues  of  the  CG 
and  GMRES  updates  are  documented  below. 


The  FETI-DP  Algorithm 

For  a  3-D  substructure,  each  interface  node  can  be  a 
face,  edge,  or  vertex  node.  Of  these,  the  edge  and  vertex 
nodes  —  that  are  common  to  more  than  two  substruc¬ 
tures  —  are  designated  as  corner  nodes.  The  degrees  of 
freedom  (DOFs)  associated  with  the  corner  nodes  are  for¬ 
mulated  as  a  primal  interface.  The  rest  are  formulated 
as  a  dual  interface.  The  corner  nodes  form  the  coarse 
problem,  and,  are  key  to  ensuring  optimal  scalability. 

The  number  of  DOFs  associated  with  a  corner  de¬ 
pend  on  the  order  of  the  problem,  e.g.,  3  for  2nd  order 
brick  FEM  or  6  for  4th  order  plate  or  shell  elements. 
Thus,  it  renders  the  coarse  mesh  automatically  denser 
with  increase  in  order.  The  FETI-DP  method  and  its  im¬ 
plementation  follows  entirely  the  work  of  Refs.  [15,  16]. 
A  detailed  exposition  of  our  implementation  is  not  pro¬ 


vided,  only  a  brief  description  of  the  key  components  are 
summarized  below. 

If  the  nodes  of  a  subdomain  are  re-ordered  with  in¬ 
ternals  first  (Is),  followed  by  the  interface  edges  and  faces 
(r|.),  and  then  the  corners  (T^),  where  the  subscript  s 
denotes  subdomain  quantities,  then  a  subdomain  matrix, 
say  the  stiffness  matrix,  takes  the  following  form 


Ks 

1XII 

K)e 

kic  ' 

Ks  = 

Ksei 

F%e 

Fsec 

[  Kci 

Fqe 

Kscc  J 

Ks  Ks 
1XRR  1VRC 

Kcr  KqC  _ 

(44) 


where  the  internal  and  edge  nodes  are  denoted  together 
as  Rs  nodes.  The  subdomain  forcing,  fs,  and  unknowns, 
us,  are  correspondingly 


Two  Boolean  restrictions  are  defined  for  each  subdomain. 
The  first  Boolean  restriction,  BR,  restricts  usR  to  usE,  and 
assigns  a  +1  or  —1  sign  such  that  equality  of  the  interface 
degrees  of  freedom  are  guaranteed  upon  convergence. 

E  BrU'r  =  0 

S 

The  summation  sign  denotes  assembly  over  sub  domains. 
The  second  Boolean  restriction,  B q,  restricts  the  global 
corner  nodes  to  subdomain  corners.  Note  that,  for  a  re¬ 
ordered  subdomain,  the  first  restriction,  BR  has  the  form 
and  size 


Bsr  = 


0 


T 

-'s 

E 

I 


(47) 


where  BE  is  a  diagonal  matrix  with  entries  +1  or  —  1. 
The  dual-primal  procedure  computes  a  set  of  dual  vari¬ 
ables  associated  with  the  interface  which  on  convergence 
allows  the  recovery  of  the  subdomain  internal  and  edge 
DOFs  usR  as 

Krrur  =  Ir~  Bsrt  Xs  -  KsrcBscu9c  (48) 

and  all  the  global  corner  DOFs  u9c  as 

K*ccu9c  =  FCR\  +  f*c  (49) 


where  A  and  Xs  are  the  dual  variables  and  their  subdo¬ 
main  restrictions.  The  corner  problem,  i.e.  the  coarse 
grid  problem,  is  also  constructed  subdomain  by  subdo¬ 
main.  Formally, 


K*cc  =  J2  BSCT  [kscc  -  KZ;rK%r-'KI 

S 

FcrX  =  J2  B^KbnK^B^ Xs 

S 

fc  -  E  BcT  \fc  KchK^Fr  1 


B, 


c 


(50) 
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The  solve,  however,  is  carried  out  in  every  subdomain. 
Thus,  before  the  interface  iterations  begin,  the  subdo¬ 
main  contributions  to  KGC  are  constructed  and  factor¬ 
ized  in  every  subdomain,  and  communicated  globally. 
Thereafter,  during  the  Krylov  iterations,  the  coarse  prob¬ 
lem  is  only  a  repeated  right  hand  side  solve. 

The  building  blocks  of  the  Krylov  solver  are  briefly 
stated  below. 


Residual  calculation 

The  residual  is  calculated  in  two  parts  from  Eq.  48. 

r  =  Y  BRuR  =  ri  +  r2  (5i) 

S 

The  first  part  rq  is  calculated  as 

n  =  £  BsRK°RR-lrR  -  Y,  BsRKsRR-xB°RTy  (52) 

s  s 

The  second  part  V2  is  calculated  after  the  coarse  solve 
r2  =  -£  BsRKsRR-lK°RCBsGuac  (53) 

S 

Using  Eq.  49,  the  total  residual  then  takes  the  form 
d  —  F A.  Note  that  the  residual  calculation  requires 
KRR~  .  Therefore,  during  subdomain  partitioning,  the 
corner  node  selection  must  ensure  null  kernels. 


Preconditioner 


The  subdomain  residuals,  rs,  are  used  to  construct 
subdomain  fluxes 


rjs=KsEIws+KsEEBsErs  (54) 

with  ws  obtained  using  subdomain  Dirichlet  solves 

=  ~KjI~1KfE  BaErs  (55) 

from  which  the  preconditioned  residual  is  generated 

lr  =  YBW  (56) 


Af-1' 


Expanding  the  above  expressions,  we  have,  formally 


m-1  =  YB 


R 


0  0 

0  S' 


EE 


Bt 


(57) 


where  SEE  are  the  subdomain  Schur  complement  matri¬ 
ces.  A  more  efficient  preconditioner  (but  not  optimal)  is 
obtained  by  skipping  the  Dirichlet  solve  above  and  cal¬ 
culating  the  fluxes  directly  as 


V  s  =  KsEEBsErs 
This  leads  formally  to 

Bsr  (58) 

where  the  Schur  complement  matrices  have  been  approx¬ 
imated  by  their  leading  terms.  The  two  preconditioners 
above  are  called  the  Dirichlet  and  Lumped  precondition¬ 
ers.  All  results  shown  in  this  paper  use  the  Dirichlet 
preconditioner,  even  though  for  3-D  brick  problems  the 
Lumped  preconditioner  is  obviously  more  efficient. 


M' 


-'  =  Em 


0  0 
0  K 


EE 


Matrix-vector  multiplication 

This  is  identical  to  the  residue  calculation,  except 
usR  is  now  calculated  in  Eq.  48  using  BRT  vs  on  the  right 
hand  side,  instead  of  fR  —  BRT Xs.  vs  is  the  subdomain 
restriction  of  a  global  vector  v  that  is  to  be  multiplied. 


Figure  5:  A  4  x  4  plate  partitioning;  16 
elements  in  each  partition;  H  =  1/4,  h  = 
1/16;  Interface  and  corner  nodes  shown 
in  circles  and  squares  respectively. 


Figure  6:  A  8  x  8  plate  partitioning;  16 
elements  in  each  partition;  H  =  1/8,  h  = 
1/32;  Interface  and  corner  nodes  shown 
in  circles  and  squares  respectively. 


Numerical  scalability 

For  a  symmetric  and  coercive  elliptic  operator  the 
condition  number  of  the  preconditioned  FETI-DP  inter¬ 
face  problem  can  be  shown  to  grow  as 


k{S) 


O  (  1  +  log  ^ 


where  m  <  3 


If  the  subdomains  have  size  H,  and  the  finite  element 
mesh  has  size  h,  then  the  condition  number  of  the  in¬ 
terface  does  not  grow  with  the  number  of  subdomains 
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as  long  as  the  mesh  within  each  subdomain  is  refined 
to  keep  H/h  constant.  This  is  the  definition  of  optimal 
numerical  scalability.  Thus,  a  bigger  problem  with  addi¬ 
tional  subdomains  require  the  same  iteration  count  as  a 
smaller  problem. 

The  optimality  of  the  algorithm  is  verified  on  a  can¬ 
tilevered  square  plate  of  unit  dimensions.  Plate  bending, 
like  beam  bending,  is  governed  by  4th  order  partial  differ¬ 
ential  equations  and  is  considered  a  challenge  for  iterative 
solvers  because  their  condition  numbers  grow  at  a  rate 
0(h~4).  Tables  6  and  7  show  that  the  iteration  count 
increases  with  increase  in  H/h  (Table  6),  and  decreases 
with  decrease  in  H/h  (Table  7).  However,  if  H/h  is  held 
fixed,  Table  8  shows,  as  desired,  the  iteration  count  re¬ 
mains  relatively  constant.  Thus,  the  two  plate  problems 
shown  in  Figs.  5  and  6  converge  at  the  same  rate  (see 
rows  marked  with  <=  in  Table  8) —  even  though  the  latter 
is  four  times  larger. 


h 

ns 

nDOF 

FETI-DP 

CG 

FETI-DP 

GMRES 

1/8 

16 

216 

18 

21 

1/12 

16 

468 

23 

26 

1/16 

16 

816 

26 

31 

1/20 

16 

1260 

29 

36 

1/24 

16 

1800 

32 

39 

1/32 

16 

3168 

37 

46 

Table  6:  Iteration  count  vs.  increase  in  mesh 
refinement.  Total  number  of  subdomains  fixed 
?rs  =  16,  i.e.,  H  fixed.  H/h  increased. 


H 

ns 

nDOF 

FETI-DP 

CG 

FETI-DP 

GMRES 

1/3 

9 

1260 

31 

46 

1/4 

16 

1260 

32 

41 

1/6 

36 

1260 

29 

33 

1/8 

64 

1260 

25 

29 

1/12 

144 

1260 

21 

23 

Table  7:  Iteration  count  vs.  increase  in  number 
of  subdomains.  Total  problem  size  or  number  of 
DOFs  fixed  h  =  1/24,  i.e.,  h  fixed.  H/h  decreased. 


h 

ns 

nDOF 

FETI-DP 

CG 

FETI-DP 

GMRES 

1/12 

9 

468 

23 

29 

1/16 

16 

816 

26 

314= 

1/20 

25 

1260 

28 

32 

1/24 

36 

1800 

29 

33 

1/28 

49 

2436 

30 

34 

1/32 

64 

3168 

30 

344= 

Table  8:  Iteration  count  vs.  increase  in  number  of 
subdomains  for  an  increasing  total  problem  size. 
Total  number  of  DOFs  increases,  i.e.,  h  decreases; 
Total  number  of  subdomains  increases,  i.e.,  H  de¬ 
creases;  but  H/h  fixed. 


Parallel  Implementation  of  CG 

A  standard  Conjugate  Gradient  (CG)  update  is  as 
shown  below.  The  main  building  blocks  that  are  con¬ 
structed  using  the  parallel  FETI-DP  procedure  are  high¬ 
lighted  in  bold. 

Ao  =  0;  ro  =  d  —  FAo 

for  k  =  1,2,... 

zk-i  =  M~1rk_1 

£k  =  (zk-Hk-i)  /  (zI-2rk-2)  with  =  0 
pk  =  zk-i  +  (kPk-i  with  pi  =  Zq 

Ik  =  (4-ih-i)  /  (pI  FPk  ) 

A  k  =  Afc_i  +  7  kPk 
rk  =  rk- i  -  7 kFpk 

end 

In  addition  to  the  communication  requirements  for 
the  FETI-DP,  the  CG  update  requires  processor  synchro¬ 
nization  points  of  its  own.  These  are  points  beyond  which 
calculations  cannot  proceed  unless  all  processors  reach 
that  point.  All  vector  inner  products  are  synchroniza¬ 
tion  points.  The  two  synchronization  points  are  under¬ 
lined  above.  An  additional  synchronization  point  is  re¬ 
quired  to  calculate  the  norm  of  the  preconditioned  resid¬ 
ual  |  |zfc— 1 1 1 2 ,  to  determine  the  stopping  criteria.  In  the 
case  of  CG,  the  total  number  of  points  can  be  reduced  to 
one,  using  advanced  norm  estimation  techniques  [26,  27]. 
This  refinement  has  not  been  included  at  present,  but  is 
needed  eventually  when  thousands  of  distributed  mem¬ 
ory  nodes  are  used. 

Parallel  Implementation  of  GMRES 

A  standard  Generalized  Minimum  Residual  (GM¬ 
RES)  update  requires  an  Arnoldi  algorithm  and  a  solu¬ 
tion  of  a  least-square  problem.  The  Arnoldi  algorithm 
implemented  in  this  study  uses  a  Reorthogonalized  Clas¬ 
sical  Gram-Schmidt  procedure,  based  on  Ref.  [28].  This 
procedure  produces  levels  of  orthogonalization  that  is 
superior  to  Modified  Gram-Schmidt  [29],  while  prevent¬ 
ing  the  unacceptable  communication  costs  of  the  latter 
(explained  below).  A  Classical  Gram-Schmidt  (without 
Reorthogonalization)  is  numerically  unstable  and  is  not 
used  in  practice. 

Recall,  given  an  initial  estimate  Xq  and  residual 
ro  =  b  —  Ax o,  every  m-th  GMRES  update  for  the  so¬ 
lution  of  Ax  =  b  is  given  by  xm  =  Xq  +  K  where  K  lies 
in  the  Krylov  subspace  of  dimension  m  associated  with 
A  and  r0,  Km(A,r0)  =  span(r0,  Ar0, . . . ,  Arn~1r0),  and 
minimizes  the  norm  ||6  —  Ae||2. 

The  Arnoldi  algorithm  in  dimension  m  constructs 
an  orthonormal  basis  Vm  =  [iq,  i>2, . . . ,  vm\  of  the  Krylov 
subspace  Km(A,  ro).  The  procedure  also  generates  a  ma¬ 
trix  Hrn  of  size  (to+  1)  x  to  the  top  m  x  to  block  of  which 
is  an  upper  Hessenberg  matrix  Hm.  The  m-th  update  is 
computed  as  xm  =  Xo  +  Vmym  where  yrn  is  calculated 
such  xm  minimizes  ||6  —  Axm\\2-  This  amounts  to  cal- 
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culating  a  yrn  which  minimizes  ||/?ei  —  HmUm II2  where 
/3  =  1 1  r 0 1 1 2  and  e\  is  the  first  canonical  vector  of  5ftm+1. 
A  QR  factorization  —  employing  Givens  rotations  —  is 
used  here  to  solve  this  least  squares  problem. 

The  classical  GMRES  method  expands  the  Krylov 
subspace  dimension  to  n  and  terminates  in  at  most  n 
iterations,  where  n  is  the  size  of  A.  However,  each  itera¬ 
tion  requires  every  one  of  the  previous  basis  vectors,  and 
hence  the  method  is  expensive  in  terms  of  memory.  A 
restarted  version  of  GMRES  restricts  the  expansion  to, 
say,  m  dimensions  and  restarts  the  Arnoldi  algorithm  us¬ 
ing  xm  as  it  new  initial  guess.  These  restarts  are  called 
the  outer  iterations.  We  use  a  restarted  GMRES(m) 
method.  It  is  as  follows. 

Ao  =  0;  ro  =  d  —  FAo 

z0  =  M-1r0 

P  =  1 1  1 1 2 

for  k  =  1,2,...  till  convergence 
Vi  =  Zk-l/P 
Arnoldi  algorithm 
Least-square  solve  of  order  m: 

Calculate  ym  to  minimize 
min  \\Pei  -Hmy\\2._ 

Use  QR  factorization  of  Hm. 

Afc  —  Afc_i  T  Vrnym 
rk  =  d  -  FAk 
zk  =  M_1rk 

P  =  WZk\\2 

end 

The  Arnoldi  algorithm  consists  of  three  steps.  Given 
A  and  an  initial  vector  v\,  the  m  orthonormal  basis  vec¬ 
tors  are  constructed  as  follows. 

for  j  =  1, 2, . . . ,  m 

1.  Basis  expansion:  Wj+ 1  =  Avj 

2.  Orthogonalization:  orthogonalize  'Wj+i 
with  respect  to  all  previous  Arnoldi 
vectors  (vi,v2 , ,Vj) 

3.  Normalization:  hj+ ij  =  ||wj+1||2 
and  Vj. |_i  =  Wj+i/hj+ij. 

The  orthogonalization  is  the  main  step.  Tradition¬ 
ally,  a  Modified  Gram-Schmidt  procedure  is  preferred  in 
this  step  because  of  its  numerical  stability  over  Classical 
Gram-Schmidt.  It  is  as  follows  —  with  the  synchroniza¬ 
tion  points  underlined. 

for  j  =  1, . . . ,  to 
w  =  Fvj 
t  =  M^w 
for  i  =  1,. .. , j 
hitj  =  vjt 
t  —  t 
end 

hj+i,j  =  \\t\\2 

vj+i  =  t/hj+ij 

end 


A  complication  is  immediately  apparent  -  the  first 
set  of  points,  i.e.  the  vft  calculations,  presents  a  high 
communication  requirement.  Within  each  step  j,  the 
vector  t,  once  generated,  is  immediately  projected  to, 
and  subtracted  from,  each  and  every  one  of  the  previ¬ 
ous  Arnoldi  vectors  .  Each  projection,  a  vector  inner 
product,  requires  a  global  synchronization.  In  a  Clas¬ 
sical  Gram-Schmidt,  the  projection  and  the  subtraction 
steps  could  be  carried  out  separately,  with  a  single  syn¬ 
chronization  step  in-between.  However,  because  Classi¬ 
cal  Gram-Schmidt  is  unstable  (though  mathematically 
equivalent  to  Modified  Gram-Schmidt) ,  a  second  orthog¬ 
onalization  step  is  needed.  Thus,  the  final  Reorthogonal- 
ized  Classical  Gram-Schmidt  algorithm  is  as  follows. 

for  j  =  1, . . . ,  m 
w  =  Fvj 
t  =  M^w 

for  i  =  1,. . . , j 

K 3  =  r!  1 

end 

Global  synchronization  1 
for  i  =  1,. . . , j 
t  —  t  hijVj 
end 

for  i  =  1, . . . ,  j 
K,j  =  vft 
end 

Global  synchronization  2 
for  i  =  1,. . .  ,j 

t  =  t-  h'ijVj 

end 

h  =  h  +  h' 
hj+i,j  =  \\t\ \_2 
vj+ 1  =  t/hj+ij 
end 


3-D  FEM  ROTOR  ANALYSIS  COMPONENTS 

The  main  components  of  the  3-D  rotor  FEM  anal¬ 
ysis  are  described  in  this  section.  They  are:  geometry 
and  grids,  partition  and  corner  selection,  steady  hover 
prototype,  and  the  transient  forward  flight  prototype. 

These  are  mere  prototypes  because  we  do  not  use 
real  airloads,  do  not  have  a  trim  mechanism,  and  do  not 
have  at  present  a  true  representation  of  a  blade  structure. 
In  addition,  for  research  purposes,  we  will  consider  only 
small  size  problems,  with  which  a  large  number  of  cases 
could  be  constructed  using  the  48  processors  that  were 
at  our  disposal. 

On  the  other  hand,  every  fundamental  aspect  of  the 
physics  of  the  structural  dynamics  of  an  isolated  rotor 
blade  is  incorporated.  And,  the  parallel  solution  proce¬ 
dure  is  generic,  i.e.  it  is  independent  of  the  type  of  air- 
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tions  (Table  9).  As  before,  n\  x  ri2  x  n3  refers  to  numbers 
of  elements  along  span,  chord,  and  thickness.  Note  that 
each  element  contains  81  degrees  of  freedom  and  64  in¬ 
tegration  points. 

0.1  R0.1  R 


0.05  r  •*-» 


Grid 

ni  x  ri2 

X 

n3 

Total 

DOFs 

1 

48 

x  4 

X 

2 

12. 

,960 

2 

48 

x  4 

X 

4 

25. 

,920 

3 

96 

x  4 

X 

2 

25. 

,920 

4 

96 

x  4 

X 

4 

46. 

,656 

Table  9:  3-D  FEM  rotor  grids 


Figure  7:  Planform  of  a  hingeless  rotor  blade 
showing  two  different  span-wise  grid  resolu¬ 
tions  used  in  this  study;  c  =  0.53  m 


Figure  8:  Cross-section  of  prototype  rotor 
blade;  4x4  bricks  with  internal  nodes;  5% 
t/c,  exaggerated  scale. 

loads,  control  angle  variation,  grid,  and  material  consti¬ 
tution.  The  objective  is  to  study  the  parallel  scalability 
of  the  Newton-Krylov  solver. 

Geometry  and  Grid 

We  consider  a  hingeless  rotor  blade,  discretized  as 
shown  in  Figs.  7  and  8.  The  simple  grid  generator 
for  this  study  requires  that  the  cross-sectional  discretiza¬ 
tion  remains  the  same  along  span  and  that  all  sections 
be  solid.  With  these  assumptions,  it  is  straight  forward 
to  accommodate  an  arbitrary  variation  of  airfoil  shape, 
twist,  planform,  and  advanced  tips.  A  key  limitation  at 
present  is  that  only  one  continuous  structure  can  be  grid- 
ded.  Grid  generation,  however,  is  not  the  focus  of  this 
work.  It  is  assumed  that  a  suitable  grid  will  be  available 
to  the  solver  from  other  sources. 

The  surface  geometry,  required  for  external  forcing, 
is  defined  by  the  sectional  airfoil  coordinates.  We  use  a 
generic,  symmetric  airfoil  with  5%  thickness  (Fig.  8). 

We  consider  a  set  of  four  finite  element  discretiza- 


Each  finite  element,  naturally,  can  accommodate  its 
own  constitutive  material  model  and  ply  direction  -  we 
use  simple  isotropic  properties:  E  =  73  GPa;  v  =  0.3; 
and  p  =  2700  kg/m3  (corresponding  to  Aluminum). 

Along  with  the  dimension  c  =  0.53  m,  these  gen¬ 
erate  similar  order  of  magnitude  non-dimensional  values 
of  stiffness  and  inertia  as  soft  in-plane  hingeless  rotors. 
No  attempt  is  made  to  place  the  sectional  offsets  with 
respect  to  quarter-chord.  Thus,  the  blade  may  not  be 
dynamically  stable.  However,  the  values  will  generate 
typical  deflections  with  typical  airloads.  The  airloads  are 
an  uniform  4000  N/m2  baseline  (around  375  lb/ft  along 
span),  and  two  and  four  times  that  magnitude,  to  gener¬ 
ate  moderately  large  deformations.  These  are  referred  to 
as  airloads  1,2,  and  3.  The  pressure  airloads  act  only  on 
the  top  surface  and  have  the  non-linear  characteristics  of 
a  follower  force  -  i.e.,  they  act  normal  to  the  deformed 
surface  which  is  not  know  a  priori.  The  rotational  speed 
is  H  =  27  rad/s  (steady). 

Grid  Partitioning  and  Corner  Selection 

The  partitioning  requirements  are  unique.  The  gen¬ 
eration  of  subdomain  grids  from  a  global  grid  via  a  re¬ 
calculation  of  the  finite  element  connectivity  is  straight¬ 
forward.  Partitioners  are  widely  available  in  public  do¬ 
main,  that  carry  out  this  task  ensuring  an  optimal  bal¬ 
ancing  of  processor  loads.  However,  for  structures,  this 
is  not  the  most  important  requirement.  The  most  im¬ 
portant  requirement  is  that  the  the  coarse  problem  be 
picked  to  ensure  a  null  kernel  in  every  substructure,  i.e. 
Ksrr  be  invertible. 

The  partitioner  we  develop  as  part  of  this  research  is 
simple  -  in  that  it  handles  only  the  brick  elements  we  de¬ 
veloped,  and  it  makes  the  same  assumptions  on  grid  type 
as  our  simple  grid  generator.  Namely,  the  cross-sectional 
grids  must  remain  the  same  throughout  the  span,  regard¬ 
less  of  variations  in  geometry. 

Figure  9(a)  shows  a  generic  2-D  partition  that  is 
used  in  the  present  study.  The  blade  can  be  divided 
into  any  number  of  substructures  in  the  span-wise  and 
chord-wise  directions.  Figure  9(b)  shows  an  alternative 
1-D  partition.  It  will  be  shown  that  the  1-D  partition, 
though  naturally  load  balanced,  is  a  poor  partition  and 
should  not  be  used. 

The  partitioner  performs  the  following  tasks: 


15 


(a)  2-D  partitioned  blade  grid 


(b)  1-D  partitioned  blade  grid 


Figure  9:  2-D  and  1-D  partitioning  of  blade  grid  into  substructures.  Each  substructure  is  solved  in  a 
separate  processor. 


1.  Designates  the  comer  nodes. 

2.  Re-orders  the  subdomain  nodes  into  interior,  face, 
edge,  vertex,  and  boundary  nodes.  Re-calculates  the 
subdomain  finite  element  connectivity. 

3.  Sets  up  domain  connectivity  maps  for  substructure 
to  substructure  communication. 

The  first  and  second  are  the  key  tasks.  The  third  is 
merely  a  matter  of  book-keeping. 

Corner  Selection 

Consider  the  2-D  partition  of  Fig.  9(a).  The  nodes 
on  the  subdomain  edges  —  that  are  common  to  more 
than  two  subdomains  —  are  immediately  designated  as 
corner  nodes.  It  is  clear,  however,  that  this  definition 
makes  the  two  substructures  at  the  tip  end  (or  extrem¬ 
ities  of  the  tip  end  in  case  of  more  than  two  chord-wise 
strips)  indefinite.  Each  substructure  then  carries  a  ro¬ 
tational  rigid  body  mode  making  KSRR  non-invertible. 
Thus,  the  definition  of  corner  nodes  must  include,  in  ad¬ 
dition  those  edge  nodes,  which  may  be  common  only  to 
two  subdomains,  but  which  occur  at  the  boundaries  of 
the  structure.  With  this  definition,  the  corner  nodes  of 
a  typical  substructure  are  now  as  shown  in  Fig.  10.  This 
definition  also  enables  the  selection  of  corner  nodes  for  a 


typical  substructure  of  the  1-D  partition  (Fig.  11),  oth¬ 
erwise,  there  would  be  no  corner  nodes.  For  a  large  scale 
3-D  problem,  a  large  number  of  corner  nodes  is  generated 
by  this  procedure.  A  superior  choice  is  simply  the  sub- 
domain  vertices,  and  like  before,  additionally  those  that 
occur  at  the  boundaries  (Fig.  12).  There  is  then  always 
a  maximum  of  only  8  corner  nodes  per  subdomain  re¬ 
gardless  of  the  grid.  In  this  paper,  this  selection  is  not 
implemented.  We  use  the  previous  selection  as  shown  in 
Fig.  10. 


Node  Reorder 

The  re-ordering  brings  the  interior  nodes  first,  fol¬ 
lowed  by  interface  nodes,  then  corner  nodes,  and  lastly 
the  boundary  nodes.  The  procedure  depends  on  the  grid, 
partition  (1-D  or  2-D),  and  the  selection  of  corner  nodes. 
The  N  elemental  nodes  in  each  brick  is  associated  with 
a  natural  order  within  each  substructure.  The  natural 
order  is  then  associated  with  a  reordered  order,  and  a 
reverse  association  back  to  natural.  In  addition,  the  nat¬ 
ural  order  is  associated  with  the  global  order  because 
the  geometry  and  material  constitution  is  defined  in  the 
latter. 
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Boundary  nodes 


Figure  10:  Node  designation  for  a  typical  sub¬ 
domain  from  a  2-D  partition.  Corner  nodes 
are  edge  nodes  connecting  more  than  two  sub¬ 
structures  and  those  occurring  at  the  bound¬ 
aries. 


Figure  11:  Node  designation  for  a  typical  sub- 
domain  from  a  1-D  partition. 


Domain  Connectivity 

For  a  Lagrangian  formulation,  the  connectivity  re¬ 
mains  static,  and  needs  to  be  calculated  only  once  for 
a  grid.  The  non-floating  and  non-overlapping  nature  of 
the  partitions,  and  the  conforming  nature  of  the  finite 
elements  lead  to  no  search,  interpolation,  or  projection 
requirements.  Consequently,  there  are  no  errors  intro¬ 
duced  during  substructure  to  substructure  communica¬ 
tion.  Each  substructure  carries  a  destination  map  and  a 
reception  map.  The  destination  map  contains  the  sub¬ 
structures  to  which  quantities  are  to  be  dispatched,  and 
the  node  numbers  to  which  they  correspond.  The  recep¬ 
tion  maps  contains  the  substructures  from  which  quan¬ 
tities  are  to  be  received,  and  the  internal  node  numbers 
to  which  they  will  correspond. 


Figure  12:  Optimal  Node  designation  for  a  typ¬ 
ical  subdomain  from  a  2-D  partition.  Corner 
nodes  are  vertex  nodes  only  including  those 
occurring  at  the  boundaries. 

Steady  Hover  Prototype 

The  steady  hover  prototype  simply  solves  for  blade 
response  using  a  prescribed  pressure  airload  at  a  constant 
collective  pitch  angle.  The  non-linear  solution  procedure 
uses  Newton-Raphson  outer  iterations.  The  FETI-DP 
inner  solver  uses  CG  update  in  hover.  This  is  adequate 
as  the  stiffness  matrix  is  symmetric. 

Several  initial  updates  of  the  stiffness  matrix  are 
necessary  to  include  the  non-linear  structural  stiffness  - 
which  provides  the  key  extension-bending  non-linearity 
associated  with  rotation.  Figure  13  shows  the  conver¬ 
gence  of  this  non-linearity  in  the  left  hand  side  of  the 
figure.  Around  8  to  10  iterations  are  required.  Subse¬ 
quently,  the  stiffness  matrices  can  be  updated  only  at 
certain  intervals  if  desired,  while  using  modified  New¬ 
ton  iterations  in  between.  Once  the  structural  non- 
linearities  are  converged,  the  pressure  airloads  are  im¬ 
posed  on  the  blade.  The  direction  of  the  pressure  airloads 
is  deformation-dependent  and  unknown  a  priori.  Thus, 
equilibrium  iterations  are  required.  The  convergence  of 
the  airload  iterations  are  shown  in  the  same  plot  on  the 
right  hand  side  (corresponding  to  airloads  1,  2,  and  3). 
The  two  parts  are  shown  separately  only  because  the  lat¬ 
ter  are  equilibrium  iterations  where  the  geometric  stiff¬ 
ness  need  not  be  updated.  The  results  shown,  however, 
are  with  fully  updated  stiffness  in  every  iteration. 

In  each  iteration,  the  virtual  work  is  calculated  based 
on  the  previous  iteration  deformation  state.  An  alter¬ 
native  and  more  rigorous  approach  is  to  linearized  the 
forcing  using  incremental  displacements.  This  leads  to 
a  non-symmetric  stiffness  contribution,  which  is  however 
easily  handled  by  replacing  the  geometric  stiffness  I\  with 
1/2(K+ Kt),  since  the  role  of  the  stiffness  is  only  to  con¬ 
verge  the  Newton  iterations.  However,  iterations  are  still 
necessary  and  therefore  we  believe  the  previous  configu¬ 
ration  approach  is  more  efficient.  The  relatively  large  de- 
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Figure  13:  Convergence  of  Newton-Raphson 
outer  iterations  for  rotational  and  forcing 
non-linearities  in  hover 


Figure  15:  Convergence  pattern  of  FETI-DP 
(CG)  updates  in  steady  hover  calculations  on 
4  to  48  processors. 


Figure  14:  Blade  steady  deflection  in  hover  us¬ 
ing  prescribed  pressure  airloads  3  (only  grid 
outlines  are  shown) 


Figure  16:  Convergence  of  FETI-DP  (GMRES) 
updates  for  various  restart  parameters  in  for¬ 
ward  flight;  all  calculations  are  with  32  pro¬ 
cessors. 
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formation  corresponding  to  airloads  3  is  shown  in  Fig.  14. 

The  convergence  criteria  of  the  inner  Krylov  solver 
is  set  tightly  to  10“ 12  for  all  cases  in  this  study.  Fig.  15 
shows  the  convergence  of  the  solver  when  run  on  4  to  48 
processors.  The  convergence  corresponding  to  the  first 
Newton  iteration  is  shown,  as  it  begins  with  zero  guess 
and  takes  the  most  number  of  iterations.  The  parallel 
scalability  of  these  calculations  are  examined  in  more  de¬ 
tail  in  the  next  section.  Here,  we  note  that  the  iteration 
count  shows  an  initial  increase  with  processors.  This  is 
contrary  to  Table  7  but  is  recognized  for  3-D  brick  prob¬ 
lems  in  Ref.  [16].  The  remedy  is  an  edge-based  augmen¬ 
tation  procedure  that  is  not  incorporated  in  this  work. 

Transient  Forward  Flight  Prototype 

The  transient  forward  flight  prototype  solves  for 
blade  response  using  the  same  set  of  prescribed  pressure 
airloads  as  in  hover,  but  now  the  stiffness  and  forcing  val¬ 
ues  are  those  that  arise  out  of  a  single  time  step  of  a  time¬ 
marching  procedure.  We  consider  a  Newmark  scheme 
with  a  5°  azimuth  step.  The  control  angle  variation  is 
taken  as  9  (ip)  =  20°  +  5°  cos  ip  —  5°  sin  ip.  The  dynamic 
stiffness  now  contains  the  complete  non-symmetric  in¬ 
ertial  terms.  Therefore,  the  FETI-DP  inner  solver  now 
uses  a  GMRES  update. 

The  matrix  structure  will  be  identical  in  every  time 
step,  therefore,  for  purposes  of  scalability  it  is  enough  to 
study  only  one  time  step.  Unlike  CG  where  each  update 
requires  only  two  previous  solutions,  in  GMRES,  each  up¬ 
date  requires  all  of  the  previous  updates.  The  restarted 
version,  as  explained  earlier,  uses  at  the  most  m  updates 
at  a  time,  starting  afresh  with  a  superior  initial  guess 
each  time. 

Figure  16  shows,  that  for  the  small  grid  size  used 
here  (i.e.  a  small  bandwidth),  the  convergence  pattern 
remains  similar  over  a  wide  range  of  restart  parameters 
-  even  m  =  5  is  adequate.  For  purposes  of  a  realistic 
scalability  study,  however,  we  will  consider  m  =  30,  40, 
and  50.  These  are  deemed  more  suitable  for  large  scale 
problem  sizes. 

SCALABILITY  OF  3-D  ROTOR  ANALYSIS 

The  parallel  scalability  of  the  3-D  FEM  solver  is  doc¬ 
umented  here  in  detail.  The  first  section  deals  with  the 
steady  hover  prototype.  The  Krylov  solver  uses  a  CG 
update  here.  The  idea  of  substructure  optimality  and 
the  definition  of  scalability  are  introduced.  The  second 
section  deals  with  the  transient  forward  flight  prototype. 
The  Krylov  solver  is  equipped  with  a  GMRES  update 
here. 

Steady  Hover 

Consider  the  grid  of  size  96  x  4  x  2.  It  is  partitioned 
into  6  to  48  substructures  using  a  2-D  partitioning,  i.e., 
having  an  arrangement  as  Fig.  9(a)  with  3  x  2  to  24  x  2 
subdomains. 


ns 

FE 

Subdomain 

LU 

Coarse 

problem 

FETI-DP 

CG 

Solver 

total 

6 

201 

757 

130 

775 

1668 

8 

200 

463 

91 

622 

1180 

12 

199 

246 

63 

458 

770 

16 

194 

165 

52 

361 

580 

24 

191 

96 

51 

267 

416 

32 

190 

66 

71 

213 

350 

48 

190 

39 

168 

187 

395 

Table  10:  Solver  time  (secs)  vs.  number  of  sub¬ 
structures  ns  on  a  single  processor 


np 

FE 

Subdomain 

LU 

Coarse 

problem 

FETI-DP 

CG 

Solver 

total 

6 

32.6 

121.80 

21.56 

94.07 

237.87 

8 

24.2 

57.44 

12.33 

57.23 

127.28 

12 

16.2 

20.95 

5.80 

26.85 

53.73 

16 

12.6 

10.54 

3.64 

14.93 

29.19 

24 

8.18 

4.16 

2.71 

7.96 

14.89 

32 

6.01 

2.20 

2.92 

5.70 

10.85 

48 

4.24 

0.88 

5.62 

5.15 

11.70 

Table  11:  Solver  time  (secs)  vs.  number  of  proces¬ 
sors  np ;  each  processor  contains  one  substructure 

The  dramatic  reduction  in  solution  time  is  clear  from 
Tables  10  and  11.  The  details  are  described  later.  Here, 
we  note  that  a  direct  solution  of  this  problem  takes  more 
than  50, 000s.  The  iterative  substructuring  method,  with 
32  substructures,  but  still  on  a  single  processor,  brings  it 
down  to  only  350s  (Table  10).  In  addition,  the  method  is 
now  fully  parallel.  Therefore,  a  scalable  implementation, 
on  32  processors,  finally  brings  it  down  drastically  to 
10.85s  (Table  11). 

Consider  the  solver  times  for  this  problem  on  a  single 
processor  (Fig.  17).  The  importance  of  substructuring  is 
immediately  apparent.  There  is  a  steep  drop  in  solu¬ 
tion  time  with  increasing  number  of  substructures.  For 
a  problem  of  fixed  size,  a  condition  of  diminishing  return 
must  eventually  be  reached,  with  an  optimal  number  of 
substructures  producing  the  minimum  solution  time.  We 
shall  call  this  the  substructure  optimality  number.  For 
this  problem  it  is  32.  Note,  however,  that  the  rise  in  so¬ 
lution  time  beyond  the  optimality  point  is  not  nearly  as 
steep  as  its  decline  prior  to  it,  and  there  is  a  large  region 
over  which  it  remains  flat.  For  this  problem,  this  region 
is  roughly  from  16  to  48  substructures.  This  flat  region 
is  a  gift  of  iterative  substructuring.  It  is  shown  later  that 
this  region  is  sensitive  to  the  partitioning  and  corner  se¬ 
lection  scheme.  A  good  partitioner  and  corner  selector 
will  keep  this  region  flat,  a  poor  one  will  produce  a  steep 
rise. 

The  parallel  implementation  solves  each  substruc¬ 
ture  on  a  separate  processor.  To  calculate  parallel  speed¬ 
up,  it  is  important  to  use  the  single  processor  time  with 
the  same  number  of  substructures  as  the  baseline.  That 
is,  the  solver  time  with,  say  32  processors,  must  be  com- 
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Figure  17:  Solver  time  vs.  number  of  substruc¬ 
tures  for  calculations  on  a  single  processor. 
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Figure  18:  Parallel  speed-up  for  calculations  on 
multiple  processors;  each  substructure  solved 
in  a  separate  processor. 
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Figure  19:  Effect  of  grid  partitioning  and  cor¬ 
ner  selection  on  solver  time  as  a  function  of 
number  of  substructures;  calculations  on  a 
single  processor. 
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Figure  20:  Effect  of  grid  partitioning  and  cor¬ 
ner  selection  on  parallel  speed-up  for  calcula¬ 
tions  on  multiple  processors. 
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Figure  21:  Two  different  problem  sizes 
with  the  same  substructure  optimality; 
solver  time  vs.  number  of  substructures 
on  a  single  processor. 


Figure  22:  Parallel  speed-up  for  problem 
sizes  with  the  same  substructure  opti¬ 
mality;  solver  time  vs.  number  of  pro¬ 
cessors. 


pared  with  the  corresponding  solver  time  on  a  single  pro¬ 
cessor  that  uses  32  substructures.  This  is  to  ensure  that 
computations  of  the  same  complexity  are  compared,  oth¬ 
erwise,  the  speed-up  is  contaminated  with  the  benefits  of 
substructuring  and  a  super-linear  number  is  always  ob¬ 
tained.  This  is  because  using  32  substructures  on  a  sin¬ 
gle  processor  by  itself  reduces  the  solver  time  by  more 
than  32  -  a  fact  that  has  nothing  to  do  with  paralleliza¬ 
tion  but  substructuring  itself.  The  parallel  speed-up  is 
shown  in  Fig.  18.  Even  for  this  fixed  problem  size,  it 
has  a  perfectly  linear  trend  up  to  the  point  of  substruc¬ 
ture  optimality.  Note  that  the  point  corresponding  to  32 
processors  has  nothing  to  do  with  the  point  correspond¬ 
ing  to  16  processors.  Indeed,  the  32  processor  run  takes 
only  10.9s  whereas  the  16  processor  run  takes  29.2s,  i.e. 
less  than  half  the  time,  due  to  the  dramatic  reduction  in 
subdomain  LU  time  (see  Table  10).  What  the  speed-up 
plot  shows  is  that  32  processors  run  the  problem  exactly 
32  times  faster  compared  to  a  single  processor  using  32 
substructures. 

The  drop  off  in  scalability  beyond  32  processors  is 
studied  using  the  detailed  timings  for  the  different  parts 
of  the  computation.  The  timings  for  the  single  proces¬ 
sor  and  parallel  calculations  are  given  in  Tables  10  and 
11.  In  the  tables,  ‘FE’  refers  to  the  time  taken  to  con¬ 
struct  the  structural  matrices.  ‘Solver  total’  refers  to 
the  total  solver  time.  The  two  together  constitute  the 
total  simulation  time.  ‘Solver  total’  consists  of  three 
parts:  (1)  ‘Subdomain  LU’  time,  which  refers  to  the  sub- 
domain  factorization,  (2)  ‘Coarse  problem’  time,  which 
refers  to  the  coarse  problem  factorization  and  communi¬ 
cation,  and  (3)  the  ‘FETI-DP’  time,  refers  to  the  Krylov 
solver  time.  Note  that  the  later  includes  the  computation 
and  communication  costs  of  the  residual,  preconditioner, 
and  matrix- vector  multiplies,  and  the  additional  commu¬ 
nications  required  for  the  updates.  The  communications 


costs  are,  of  course,  incurred  only  during  the  parallel  cal¬ 
culations. 

From  Table  10,  which  shows  the  single  processor 
timings,  the  reason  behind  the  flat  region  in  Fig.  17  is 
clear.  The  growth  in  the  coarse  problem  is  offset  by  the 
reduction  in  the  Krylov  solver  time.  This  is  expected 
-  as  the  purpose  of  the  coarse  problem  is  precisely  that 
but  the  key  point  is  that  the  coarse  solver  should  be 
just  enough  to  serve  this  purpose  and  no  larger,  so  that 
the  substructure  optimality  is  pushed  to  as  high  a  pro¬ 
cessor  number  as  possible.  Beyond  the  optimality  point, 
any  growth  in  the  coarse  problem  is  an  indicator  of  in¬ 
creased  communication  cost  for  the  parallel  implemen¬ 
tation.  Note  that  the  coarse  problem  is  solved  in  every 
processor  and  as  such,  requires  a  global  communication. 
The  drop  off  in  Fig.  18  is  a  direct  consequence  of  this 
communication  cost.  To  summarize,  a  key  objective  of 
the  coarse  problem  selection  should  be  to  suppress  the 
growth  beyond  substructure  optimality  to  as  gradual  as 
possible.  This  has  little  bearing  upon  scalability  with 
problem  size,  but  serves  to  extend  linear  speed-up  for 
a  fixed  problem  size,  to  as  high  a  processor  number  as 
possible. 

We  illustrate  the  importance  of  the  coarse  problem 
with  a  worse  partitioning.  The  same  problem,  when 
treated  with  a  1-D  partitioning  as  illustrated  earlier  in 
Figs.  9(b)  and  11,  generates  a  timing  and  scalability  plot 
as  shown  in  Figs.  19  and  20.  The  2-D  partitioning  results 
are  also  plotted  for  comparison.  Clearly,  from  Fig.  19, 
the  same  problem  now  has  a  substructure  optimality  of 
16,  as  opposed  to  32.  A  good  parallel  implementation 
should  guarantee  a  linear  speed-up  to  at  least  16  proces¬ 
sors.  Scalability  beyond  this  number  is  expected  to  be 
affected  adversely  by  communication  costs  of  the  coarse 
problem.  This  is  exactly  what  is  observed  in  Fig.  20. 

Thus,  the  linear  speed-up  range  is  not  a  function 
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Figure  23:  Two  different  problem  sizes 
with  the  same  substructure  optimality; 
solver  time  vs.  number  of  substructures 
on  a  single  processor. 
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Figure  24:  Parallel  speed-up  for  problem 
sizes  of  same  substructure  optimality. 


of  problem  size  alone  but  also  of  substructure  optimal¬ 
ity.  For  example,  Figs.  21  and  22  compare  the  single 
processor  solution  time  and  parallel  speed-up  of  a  bigger 
problem  of  size  96  x  4  x  4.  From  the  single  processor 
solution  time,  it  is  clear  that  the  substructure  optimality 
is  still  32.  As  a  result,  the  linear  speed-up  range  still  ex¬ 
tends  only  up  to  32.  The  same  conclusions  hold  for  very 
small  problem  sizes,  as  shown  in  Figs.  23  and  24.  The 
smallest  grid  of  48  x  4  x  2  shows  a  linear  speed-up  up  to 
24  processors  -  it’s  substructure  optimality  -  in  exactly 
the  same  manner  as  a  grid  of  48  x  4  x  4  twice  its  size. 

In  summary,  given  a  problem  size,  the  solver  shows 
a  linear  speed-up  -  for  at  least  as  many  processors  as  its 
substructure  optimality.  To  extend  this  linear  speed-up 
range,  a  smaller  coarse  problem  is  required.  An  example 
of  such  a  selection  was  shown  earlier  in  Fig.  12. 

The  scalability  with  increasing  problem  size  is  il¬ 


Figure  25:  Solver  time  vs.  number  of  sub¬ 
structures  on  a  single  processor;  FETI- 
DP  (GMRES). 


np 

Problem  1 
48  x  4  x  2 

Problem  2 
96  x  4  x  2 

Problem  3 
48  x  4  x  4 

6 

51.52 

237.87 

232.00 

8 

28.39 

127.28 

126.70 

12 

13.07 

53.73 

55.43 

16 

8.22 

29.19 

32.72 

24 

5.73 

14.89 

21.69 

32 

5.70 

10.85 

22.28 

48 

9.62 

11.70 

38.89 

Table  12:  Parallel  solver  times  (secs)  showing  seal- 
ability  with  respect  to  problem  size  up  to  limits 
of  substructure  optimality. 


lustrated  in  Table  12.  Problem  2  has  twice  the  size  of 
Problem  1  (from  Table  9).  Problem  3  has  the  same  size 
as  Problem  2,  only  different  grid  characteristics.  Prob¬ 
lem  2  and  3,  therefore,  have  similar  solver  times,  up  to 
substructure  optimality  which  sets  in  at  24  processors 
for  Problem  3.  The  timings  of  Problem  1  and  Problem  2 
provide  a  simple  illustration  of  scalability  with  increasing 
size.  Because  Problem  2  is  twice  the  size,  twice  the  num¬ 
ber  of  processors  provide  approximately  the  same  solve 
time.  For  example,  Problem  2  on  12  processors  take  sim¬ 
ilar  time  as  Problem  1  on  6.  Problem  2  on  16  take  similar 
time  as  Problem  1  on  8. 

Transient  Forward  Flight 

The  conclusions  drawn  on  effective  partitioning  and 
substructure  optimality  in  the  previous  section  are  car¬ 
ried  over  to  this  section.  These  results,  which  are  similar 
to  those  shown  in  hover,  are  not  repeated  here.  The 
scalability  results  for  a  single  grid  size  96  x  4  x  2  (with 
substructure  optimality  of  32)  is  presented.  The  results 
for  the  other  grids  compare  similarly  to  hover. 

The  scalability  of  the  parallel  FETI-DP  (GMRES) 
solver  involving  the  non-symmetric  matrices  in  forward 
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Figure  26:  Parallel  speed-up  of  FETI-DP 
(GMRES)  solver  using  Classical  Gram- 
Schmidt  with  Reorthogonalization  based 
Arnoldi  procedure. 
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Figure  27:  Parallel  speed-up  of  FETI-DP 
(GMRES)  solver  using  Modified  Gram- 
Schmidt  based  Arnoldi  procedure. 


CONCLUDING  OBSERVATIONS 


np 

Modified  GS 

Classical  GS 
with  Reorth. 

6 

194.08 

190.17 

8 

109.47 

100.47 

12 

41.69 

41.43 

16 

23.92 

24.24 

24 

12.33 

11.77 

32 

8.88 

8.61 

48 

10.67 

10.04 

Table  13:  Parallel  FETI-DP  (GMRES)  solver 
times  (secs)  with  Modified  Gram-Schmidt  and 
Classical  Gram-Schmidt  with  Reorthogonaliza¬ 
tion  based  Arnoldi  procedures. 


flight  shows  a  similar  linear  trend  as  those  of  the  FETI- 
DP  (CG)  solver  in  hover.  The  single  processor  timings 
are  shown  in  Fig.  25.  All  calculations  use  the  Reorthogo- 
nalized  Classical  Gram-Schmidt  Arnoldi  procedure.  The 
increasing  restart  parameters  all  show  the  same  timings 
on  a  single  processor  -  as  expected,  because  their  affect 
is  only  on  memory  ■  but  incur  increasing  communication 
costs  for  a  parallel  calculation.  However,  for  the  small 
sized  problems  considered  here,  there  are  no  discernible 
differences  between  the  three  versions  even  in  parallel.  As 
a  result,  the  scalability  plot  shown  in  Fig.  26  is  identical 
for  all  three  cases.  Indeed,  even  the  Modified  Gram- 
Schmidt  procedure  show  the  same  scalability  behavior 
(Fig.  27).  The  latter  is  expected  to  be  drastically  inferior 
for  large  subdomain  problems.  Note  however,  regardless 
of  scalability,  the  actual  solution  times  for  Reorthogo- 
nalized  Classical  Gram-Schmidt  are  by  themselves  lower 
compared  to  the  Modified  Gram-Schmidt.  This  differ¬ 
ence  is  expected  to  be  drastic  for  large  scale  problems. 
This  trend  is  clear  in  Table  13. 


The  main  objective  of  this  paper  was  to  demon¬ 
strate  a  parallel  and  scalable  solution  procedure  for  a  3-D 
FEM  based  dynamic  analysis  of  helicopter  rotor  blades. 
The  3-D  FEM  analysis  was  formulated  with  an  empha¬ 
sis  on  the  non-linear  structural,  and  inertial  terms  that 
are  unique  to  rotorcraft.  Second  order,  isoparametric, 
hexahedral  brick  elements,  that  are  well  suited  to  dis¬ 
cretize  a  rotor  blade  structure,  were  developed  to  carry 
out  this  study.  A  dual-primal  iterative  substructuring 
based  Krylov  solver  was  developed  for  a  fully  parallel  so¬ 
lution  procedure.  The  iterative  substructuring  method 
was  built  upon  the  FETI-DP  domain  decomposition  al¬ 
gorithm.  The  Krylov  solver  was  equipped  with  GMRES 
updates,  in  addition  to  its  traditional  CG  update,  due  to 
the  the  non-symmetric  nature  of  the  inertial  terms.  A 
detailed  study  was  carried  out  on  the  scalability  of  the 
solution  procedure  for  both  hover  and  transient  forward 
flight  conditions.  Based  on  this  study,  the  following  key 
conclusions  are  drawn. 

Key  conclusions 

1.  A  3-D  FEM  based  rotor  dynamic  analysis  can  be 
carried  out  using  a  fully  parallel  and  scalable  solu¬ 
tion  procedure. 

2.  Given  a  fixed  problem  size,  there  is  always  an  op¬ 
timal  number  of  substructures  into  which  it  can  be 
decomposed  -  termed  substructure  optimality  -  so 
as  to  require  the  minimum  solution  time. 

3.  The  3-D  FEM  analysis  presented  in  this  paper  is 
scalable,  i.e.  shows  a  linear  speed-up  up  to  the  point 
of  substructure  optimality.  A  p-processor  calcula¬ 
tion  with  a  separate  substructure  in  each  processor 
takes  1/p  the  time  compared  to  a  single  processor 
with  p  substructures.  It  also  scales  with  problem 
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size.  That  is,  a  n-times  larger  problem  takes  similar 
time  with  n  x  p  processors. 

4.  For  a  fixed  problem  size,  a  drop  off  in  scalability 
eventually  occurs  -  but  not  before  the  subdomain 
optimality  number  is  reached.  At  that  point,  there 
is  no  reason  to  use  any  more  processors  -  unless  a 
larger  problem  is  attacked  -  in  which  case,  linear 
speed-up  is  restored  again  up  to  the  new  optimal. 
However,  even  if  more  processors  than  optimal  is 
used,  the  scalability  only  reaches  a  plateau,  and  does 
not  show  a  dramatic  fall.  The  plateau  stems  from 
iterative  substructuring  and  is  due  to  the  flat  region 
at  the  bottom  of  the  time  vs.  substructure  curve. 

5.  The  drop  in  scalability  beyond  substructure  optimal¬ 
ity  is  traced  to  two  factors:  the  increasing  substruc¬ 
ture  to  substructure  communication  cost,  and,  the 
global  coarse  problem  communication  cost.  The  first 
penalty  is  relatively  minor,  and  can  be  minimized 
with  a  minimum  number  of  synchronization  points 
during  the  Krylov  update.  This  is  more  relevant  to 
the  GMRES.  The  second  penalty,  which  stems  from 
the  coarse  problem,  is  an  important  factor. 

6.  The  size  of  the  coarse  problem  is  the  key  driver  for 
both  parallel  scalability  as  well  as  solution  time.  The 
global  communication  required  by  the  coarse  prob¬ 
lem  drives  scalability.  The  size  of  the  coarse  problem 
drives  solution  time.  In  order  to  ensure  scalabil¬ 
ity  while  decreasing  solution  time,  a  minimal  coarse 
problem  should  be  selected. 

7.  The  coarse  problem  is  selected  by  the  domain  parti¬ 
tioning  algorithm.  Partitioning  has  special  require¬ 
ments  in  structures  -  not  just  any  will  do.  The  key 
idea  is  that  the  coarse  problem  must  ensure  null  ker¬ 
nels  in  each  substructure  while  satisfying  the  criteria 
of  minimal  selection  (conclusion  6). 

In  summary,  the  parallel  solver  developed  in  this 
study  can  solve  both  hover  and  transient  forward  flight 
response  in  a  scalable  manner.  For  example,  each  New¬ 
ton  iteration  of  a  25,  920  DOFs  problem,  could  be  solved 
in  approximately  8  —  10  seconds  on  32  processors  with 
a  residual  reduction  level  of  10-12.  A  direct  solve  on  a 
single  processor,  in  comparison,  is  not  feasible  -  requir¬ 
ing  more  than  50, 000s.  Real  blade  structures  will  contain 
millions  of  DOFs.  The  scalability  of  the  solver  must  then 
be  tested  on  100s  and  1000s  of  processors.  Actual  solver 
timings  would  be  as  important  as  scalability. 

The  next  fundamental  challenge  towards  the  appli¬ 
cation  of  this  solver  is  the  problem  of  periodic  response. 
Transient  response  is  of  little  use  in  rotary  wing  dynamics 
due  to  the  requirement  of  obtaining  periodic  solutions  in 
presence  of  undamped  modes.  The  requirement  for  a  fi¬ 
nite  element  in  time  or  non-linear  harmonic  balance  type 
method  will  be  realized  more  acutely  during  high  fidelity 
analysis.  A  time  based  domain  decomposition  mecha¬ 
nism  must  be  innovated.  This,  and  other  key  challenges 
are  summarized  below  in  way  of  conclusion. 


Future  Research  Areas 

A  suggested  list  for  future  directions  of  this  research 
is  given  below  subdivided  into  three  categories. 

Fundamental  Research 

1.  Periodic  dynamics  -  scalable  domain  decomposition 
in  time  for  temporal  boundary  value  problems. 

2.  3-D  FEM  and  multibody  dynamics  coupling. 

Applied  Research 

1.  Nodeless  elements  for  multibody  dynamics  coupling. 
Efficient,  locking-free,  hierarchical  elements. 

2.  3-D  fluid-structure  interfaces.  Innovative  delta  cou¬ 
pling  for  trim  solution  in  level  and  turning  flight. 

Application  Development 

1.  Interfacing  with  3-D  solid  geometry  and  grid  tools. 

2.  Smart  substructuring  -  corner  node  selection,  inter¬ 
face  localization,  and  nodal  reordering. 
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